You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/01/13 22:20:15 UTC

[Solr Wiki] Trivial Update of "Suggester" by Juan Grande

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by Juan Grande.
The comment on this change is: Replaced all occurrences of "location" by "sourceLocation". Fixed "Search handler configuration" section's bulleting..
http://wiki.apache.org/solr/Suggester?action=diff&rev1=6&rev2=7

--------------------------------------------------

  = Suggester - a flexible "autocomplete" component. =
- 
  A common need in search applications is suggesting query terms or phrases based on incomplete user input. These completions may come from a dictionary that is based upon the main index or upon any other arbitrary dictionary. It's often useful to be able to provide only top-N suggestions, either ranked alphabetically or according to their usefulness for an average user (e.g. popularity, or the number of returned results).
  
  Solr 3.x and 4.x include a component called Suggester that provides this functionality. See [[https://issues.apache.org/jira/browse/SOLR-1316|SOLR-1316]] JIRA issue for the original motivations and patches.
  
  Suggester reuses much of the SpellCheckComponent infrastructure, so it also reuses many common SpellCheck parameters, such as `spellcheck=true` or `spellcheck.build=true`, etc. The way this component is configured in `solrconfig.xml` is also very similar:
+ 
  {{{
    <searchComponent class="solr.SpellCheckComponent" name="suggest">
      <lst name="spellchecker">
@@ -34, +34 @@

      </arr>
    </requestHandler>
  }}}
- 
  The look-up of matching suggestions in a dictionary is implemented by subclasses of the Lookup class - there are two implementations that are included in Solr, both are based on in-memory tries: JaspellLookup and TSTLookup. Benchmarks indicate that TSTLookup provides better performance at a lower memory cost (roughly 50% faster and 50% of memory cost) - however, JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's a one line change in JaspellLookup).
  
  An example of an autosuggest request:
+ 
  {{{
  http://localhost:8983/solr/suggest?q=ac
  }}}
+ And the corresponding response:
  
- And the corresponding response:
  {{{
  <?xml version="1.0" encoding="UTF-8"?>
  <response>
@@ -62, +62 @@

    </lst>
  </response>
  }}}
- 
  = Configuration =
  The configuration snippet above shows a few common configuration parameters. Here's a complete list of them:
  
  == SpellCheckComponent configuration ==
- 
  * `searchComponent/@name` - arbitrary name for this component
  
  * `spellchecker` list:
+ 
-   * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters and in SearchHandler configuration - see the section below)
+  * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters and in SearchHandler configuration - see the section below)
-   * `classname` - Suggester, to provide the autocomplete functionality
+  * `classname` - Suggester, to provide the autocomplete functionality
-   * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
+  * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
-     * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
+   * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
-     * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
+   * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
-   * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after commit. If false (default) then the Lookup data will be built only when requested (by URL parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit.'''
+  * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after commit. If false (default) then the Lookup data will be built only when requested (by URL parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit.'''
-   * `location` - location of the dictionary file. If not empty then this is a path to a dictionary file (see below). If this value is empty then the main index will be used as a source of terms and weights.
+  * `sourceLocation` - location of the dictionary file. If not empty then this is a path to a dictionary file (see below). If this value is empty then the main index will be used as a source of terms and weights.
-   * `field` - if `location` is empty then terms from this field in the index will be used when building the trie.
+  * `field` - if `sourceLocation` is empty then terms from this field in the index will be used when building the trie.
-   * `threshold` - threshold is a value in [0..1] representing the minimum fraction of documents (of the total) where a term should appear, in order to be added to the lookup dictionary.
+  * `threshold` - threshold is a value in [0..1] representing the minimum fraction of documents (of the total) where a term should appear, in order to be added to the lookup dictionary.
  
  == Dictionary ==
- When a file-based dictionary is used (non-empty `location` parameter above) then it's expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are ignored. The remaining lines must consist of either a string without literal TAB (\u0007) character, or a string and a TAB separated floating-point weight.
+ When a file-based dictionary is used (non-empty `sourceLocation` parameter above) then it's expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are ignored. The remaining lines must consist of either a string without literal TAB (\u0007) character, or a string and a TAB separated floating-point weight.
  
  Example:
+ 
  {{{
  # This is a sample dictionary file.
  
@@ -92, +92 @@

  accidentally\t2.0
  accommodate\t3.0
  }}}
- 
  If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching suggestions when `spellcheck.onlyMorePopular=true` is selected - weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights.
  
  Please note that the format of the file is not limited to single terms but can also contain phrases - which is an improvement over the TermsComponent that you could also use for a simple version of autocomplete functionality.
  
  === Threshold parameter ===
- As mentioned above, if the `location` parameter is empty then the terms from a field indicated by the `field` parameter are used. It's often the case that due to imperfect source data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms, which means that the dictionary built indiscriminately from a real-life index would consist mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the size of in-memory structures it's best to set the `threshold` parameter to a value slightly above zero (0.5% in the example above). This already vastly reduces the size of the dictionary by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still preserving most of the common terms. This parameter has no effect when using a file-based dictionary - it's assumed that only useful terms are found there. ;)
+ As mentioned above, if the `sourceLocation` parameter is empty then the terms from a field indicated by the `field` parameter are used. It's often the case that due to imperfect source data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms, which means that the dictionary built indiscriminately from a real-life index would consist mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the size of in-memory structures it's best to set the `threshold` parameter to a value slightly above zero (0.5% in the example above). This already vastly reduces the size of the dictionary by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still preserving most of the common terms. This parameter has no effect when using a file-based dictionary - it's assumed that only useful terms are found there. ;)
  
  == SearchHandler configuration ==
  In the example above we add a new handler that uses SearchHandler with a single SearchComponent that we just defined, namely the `suggest` component. Then we define a few defaults for this component (that can be overridden with URL parameters):
  
- * `spellcheck=true` - because we always want to run the Suggester for queries submitted to this handler.
+  * `spellcheck=true` - because we always want to run the Suggester for queries submitted to this handler.
- * `spellcheck.dictionary=suggest` - this is the name of the dictionary component that we configured above.
+  * `spellcheck.dictionary=suggest` - this is the name of the dictionary component that we configured above.
- * `spellcheck.onlyMorePopular=true` - if this parameter is set to true then the suggestions will be sorted by weight ("popularity") - the `count` parameter will effectively limit this to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.
+  * `spellcheck.onlyMorePopular=true` - if this parameter is set to true then the suggestions will be sorted by weight ("popularity") - the `count` parameter will effectively limit this to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.
- * `spellcheck.count=5` - specifies to return up to 5 suggestions.
+  * `spellcheck.count=5` - specifies to return up to 5 suggestions.
- * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
+  * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
  
  = Tips and tricks =
- 
- * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12873599|benchmark results]] - the source of this benchmark is in SuggesterTest.
+ * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|benchmark results]] - the source of this benchmark is in SuggesterTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the build time and to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can be used to limit the impact of terms that occur in a larger portion of documents. Values above 0.5 probably don't make much sense.
  
  * Don't forget to invoke `spellcheck.build=true` after core reload. Or extend the Lookup class to do this on init(), or implement the load/save methods in Lookup to persist this data across core reloads.
  
  * If you want to use a dictionary file that contains phrases (actually, strings that can be split into multiple tokens by the default QueryConverter) then define a different QueryConverter like this:
+ 
  {{{
    <!--
    The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens.  Uses a simple regular expression