You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/04/14 13:11:01 UTC

[Solr Wiki] Update of "Suggester" by DawidWeiss

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by DawidWeiss.
http://wiki.apache.org/solr/Suggester?action=diff&rev1=9&rev2=10

--------------------------------------------------

        <str name="name">suggest</str>
        <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
        <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
+       <!-- Alternatives to lookupImpl: 
+            org.apache.solr.spelling.suggest.fst.FSTLookup   [finite state automaton]
+            org.apache.solr.spelling.suggest.jaspell.JaspellLookup [default, jaspell-based]
+            org.apache.solr.spelling.suggest.tst.TSTLookup   [ternary trees]
+       -->
        <str name="field">name</str>  <!-- the indexed field to derive suggestions from -->
        <float name="threshold">0.005</float>
        <str name="buildOnCommit">true</str>
@@ -36, +41 @@

      </arr>
    </requestHandler>
  }}}
- The look-up of matching suggestions in a dictionary is implemented by subclasses of the Lookup class - there are two implementations that are included in Solr, both are based on in-memory tries: JaspellLookup and TSTLookup. Benchmarks indicate that TSTLookup provides better performance at a lower memory cost (roughly 50% faster and 50% of memory cost) - however, JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's a one line change in JaspellLookup).
+ The look-up of matching suggestions in a dictionary is implemented by subclasses of the Lookup class - there are three implementations that are included in Solr:
+  * JaspellLookup - tree-based representation based on Jaspell,
+  * TSTLookup - ternary tree based representation, capable of immediate data structure updates,
+  * FSTLookup - automaton based representation; slower to build, but consumes far less memory at runtime (see performance notes below).
+ 
+ For practical purposes all of the above implementations will most likely run at similar speed when requests are made via the HTTP stack (which will
+ become the bottleneck). Direct benchmarks of these classes indicate that FSTLookup provides better performance compared to the other two methods, at a much lower memory cost. JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's a one line change in JaspellLookup). Support for infix-suggestions is planned for FSTLookup (which would be the only structure to support these).
  
  An example of an autosuggest request:
  
@@ -65, +76 @@

  </response>
  }}}
  = Configuration =
- The configuration snippet above shows a few common configuration parameters. Here's a complete list of them:
+ The configuration snippet above shows a few common configuration parameters. A complete list of them is best found int he source code of each Lookup class, but here is an overview:
  
  == SpellCheckComponent configuration ==
  * `searchComponent/@name` - arbitrary name for this component
@@ -77, +88 @@

   * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
    * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
    * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
+   * `org.apache.solr.suggest.fst.FSTLookup` - automaton-based lookup
   * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after commit. If false (default) then the Lookup data will be built only when requested (by URL parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit.'''
   * `sourceLocation` - location of the dictionary file. If not empty then this is a path to a dictionary file (see below). If this value is empty then the main index will be used as a source of terms and weights.
   * `field` - if `sourceLocation` is empty then terms from this field in the index will be used when building the trie.
@@ -96, +108 @@

  }}}
  If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching suggestions when `spellcheck.onlyMorePopular=true` is selected - weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights.
  
- Please note that the format of the file is not limited to single terms but can also contain phrases - which is an improvement over the TermsComponent that you could also use for a simple version of autocomplete functionality.
+ Please note that the format of the file is not limited to single terms but can also contain phrases - which is an improvement over the TermsComponent that you could also use for a simple version of autocomplete functionality. 
+ 
+ FSTLookup has a built-in mechism to discetize weights into a fixed set of buckets (to speed up suggestions). The number of buckets is configurable.
  
  === Threshold parameter ===
  As mentioned above, if the `sourceLocation` parameter is empty then the terms from a field indicated by the `field` parameter are used. It's often the case that due to imperfect source data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms, which means that the dictionary built indiscriminately from a real-life index would consist mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the size of in-memory structures it's best to set the `threshold` parameter to a value slightly above zero (0.5% in the example above). This already vastly reduces the size of the dictionary by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still preserving most of the common terms. This parameter has no effect when using a file-based dictionary - it's assumed that only useful terms are found there. ;)
@@ -111, +125 @@

   * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
  
  = Tips and tricks =
- * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|benchmark results]] - the source of this benchmark is in SuggesterTest.
+ * Use FSTLookup to conserve memory (unless you need a more sophisticated matching, then look at JaspellLookup). There are some benchmarks of all three implementations: [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]] (outdated) and a bit newer here:
+ [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]]. The class to perform these benchmarks is in the source tree and is called LookupBenchmarkTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the build time and to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can be used to limit the impact of terms that occur in a larger portion of documents. Values above 0.5 probably don't make much sense.