You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2014/08/20 09:36:27 UTC
[jira] [Commented] (SOLR-5683) Documentation of Suggester V2

    [ https://issues.apache.org/jira/browse/SOLR-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103562#comment-14103562 ] 

Varun Thacker commented on SOLR-5683:
-------------------------------------

First draft at documenting the suggesters - This covers all the documentation wrt suggesters under "Issues from CHANGES.txt that were never doc'ed as part of their release:" in the https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List link.

bq. DocumentDictionaryFactory – user can specify suggestion field along with optional weight and payload fields from their search index.

Looking at the code of DocumentDictionaryFactory the weight field is not optional.
--------------------------------------------------------------------------------------------------------------------------------------------------------
field - The field from which the suggesters dictionary will be populated.
weightField - The field from which the suggestions weight will be populated. This should be a numeric field. Suggestions will be sorted based on the value as this is the sole criteria for relevance.
payloadField - Accompanying payload for each suggestion that gets built. 
suggestAnalyzerFieldType - Specify the analyzer to be used for the suggester. The "index" analyzer of this fieldType will be used to build the suggest dictionary and the "query" analyzer will be used during querying.

Config (index time) options:
name - Name of suggester. This is optional if you have only one suggester defined.
sourceLocation - External file location for file-based suggesters only.
lookupImpl - Type of lookup to use whose default is JaspellLookupFactory. A table below lists all the various lookup implementations present.
dictionaryImpl - The type of dictionary to be used when building the suggester. The default is FileDictionaryFactory for a file-based suggester and it defaults to HighFrequencyDictionaryFactory otherwise.
storeDir - Location to store the dictionary on disk.
buildOnCommit - Command to build suggester automatically after every commit that is called. Useful if you want to keep the suggester in sync with your latest data.
buildOnOptimize - Command to build suggester automatically after every optimize that is called. Useful if you want to keep the suggester in sync with your latest data.

Query time options:
suggest.dictionary - name of suggester to use
suggest.count - number of suggestions to return
suggest.q - query to use for lookup
suggest.build - command to build the suggester
suggest.reload - command to reload the suggester
buildAll – command to build all suggesters in the component
reloadAll – command to reload all suggesters in the component

--------------------------------------------------------------------------------------------------------------------------------------------------------

Lookup Implementation Options - 
- AnalyzingLookupFactory: Suggester that first analyzes the incoming text and adds the analyzed form to a weighted FST, and then does the same thing at lookup time.
	suggestAnalyzerFieldType - The analyzer used at "query-time" and "build-time" to analyze suggestions.
	exactMatchFirst - If true the exact suggestions are returned first, even if they are prefixes of other strings in the FST have larger weights.  Default is true.
	preserveSep - If true then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g. baseball is different from base ball. Default is true.
	preservePositionIncrements - Whether the suggester should preserve position increments. What this means is that token filters which leave gaps (for example when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false.

- FuzzyLookupFactory: This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm.
	exactMatchFirst - If true the exact suggestions are returned first, even if they are prefixes of other strings in the FST have larger weights.  Default is true.
	preserveSep - If true then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g. baseball is different from base ball. Default is true.
	maxSurfaceFormsPerAnalyzedForm - Maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.
	maxGraphExpansions - When building the FST ("index-time"), we add each path through the tokenstream graph as an individual entry. This places an upper-bound on how many expansions will be added for a single suggestion. The default is -1 which means there is no limit.
	preservePositionIncrements - Whether the suggester should preserve position increments. What this means is that token filters which leave gaps (for example when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false.
	maxEdits - Maximum number of string edits allowed. The systems hard limit is 2. The default is 1.
	transpositions - If true transpositions should be treated as a primitive edit operation. The default is true.
	nonFuzzyPrefix - The length of the common non fuzzy prefix match which must match a suggestion. The default is 1.
	minFuzzyLength - The minimum length of query before which any string edits will be allowed. The default is 3.
	unicodeAware -  Measure maxEdits, minFuzzyLength, transpositions and nonFuzzyPrefix parameters in unicode code points (actual letters) instead of bytes. The default is false.

- AnalyzingInfixSuggesterFactory: Analyzes the input text and then suggests matches based on prefix matches to any tokens in the indexed text. This uses a lucene index for it's dictionary. 
	indexPath - When using AnalyzingInfixSuggester you can provide your own path where the idnex will get built. The default is analyzingInfixSuggesterIndexDir and will be created in your collections data directory.
	minPrefixChars - Minimum number of leading characters before PrefixQuery is used (default 4). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster).

- BlendedInfixLookupFactory: It is an extension of the AnalyzingInfixSugegster providing an additional functionality where the prefix matches across the matched documented can be weighted. You can tell is to score higher if a hit is closer to the start of the suggestion or vice versa.
	blenderType -  used to calculate weight coefficient using the position of the first matching word. 
		linear: weightFieldValue*(1 - 0.10*position)  - Matches to the start will be given a higher score (Default)
		reciprocal: weightFieldValue/(1+position)  - Matches to the end will be given a higher score.
	numFactor - Factor to multiply the number of searched elements from which results will be pruned. Default is 10. 
	indexPath - When using BlendedInfixSuggester you can provide your own path where the index will get built. The default directory name is blendedInfixSuggesterIndexDir and will be created in your collections data directory.
	minPrefixChars - Minimum number of leading characters before PrefixQuery is used (default 4). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster).	

- FreeTextSuggesterFactory:  It looks at the last tokens plus the prefix of whatever final token the user is typing, if present to predict the most likely next token. How many previous tokens that need to be considered can also be specified. This suggester would only be used as a fallback, when the primary suggester fails to find any suggestions. 
	ngrams - The max number of tokens out of which singles will be make the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions.

- FSTLookupFactory: An FST based suggester. 
	exactMatchFirst - If true the exact suggestions are returned first, even if they are prefixes of other strings in the FST have larger weights.  Default is true.
	weightBuckets - The number of separate buckets for weights which the suggester will use while building it's dictionary.

- TSTLookupFactory: A simple compact ternary trie based lookup.

- WFSTLookupFactory: Weighted automaton representation; an alternative to FSTLookup for more fine-grained ranking. WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects weights to be whole numbers.

- JaspellLookupFactory: A more complex lookup based on a ternary trie from the JaSpell(http://jaspell.sourceforge.net/) project.

--------------------------------------------------------------------------------------------------------------------------------------------------------

Dictionary pluggability - The option to choose the dictionary implementation to use for their suggesters to consume the input from the search index.

DocumentDictionaryFactory – You need to specify the suggeestion field ('field') along with weight ('weightField') and payload('payloadField') fields from their search index.
DocumentExpressionFactory – Same as DocumentDictionaryFactory but allows users to specify arbitrary expression into the 'weightExpression' tag.
	weightExpression - Specify arbitrary expression used for scoring the suggestions. The fields need to be numeric fields. 
HighFrequencyDictionaryFactory – user can specify a suggestion field and specify a threshold to prune out less frequent terms.
Input from external files
	threshold - A value between zero and one representing the minimum fraction of the total documents where a term should appear in order to be added to the lookup dictionary.
FileDictionaryFactory – user can specify a file which contains suggest entries, along with weights and payloads. One entry is allowed per line.
	fieldDelimiter - Specify the delimiter to be used seperating the entries, weights and payloads. The default is tab.
--------------------------------------------------------------------------------------------------------------------------------------------------------
Using Multiple Suggesters -
You can request multiple suggesters to provide suggestions for the same query - 
Example Syntax - localhost:8983/solr/suggest?suggest=true&suggest.dictionary=suggest1&suggest.dictionary=suggest2&suggest.q=python

> Documentation of Suggester V2
> -----------------------------
>
>                 Key: SOLR-5683
>                 URL: https://issues.apache.org/jira/browse/SOLR-5683
>             Project: Solr
>          Issue Type: Task
>          Components: SearchComponents - other
>            Reporter: Areek Zillur
>            Assignee: Areek Zillur
>             Fix For: 4.9, 5.0
>
>
> Place holder for documentation that will eventually end up in the Solr Ref guide.
> ====
> The new Suggester Component allows Solr to fully utilize the Lucene suggesters. 
> The main features are:
> - lookup pluggability (TODO: add description):
>   -- AnalyzingInfixLookupFactory
>   -- AnalyzingLookupFactory
>   -- FuzzyLookupFactory
>   -- FreeTextLookupFactory
>   -- FSTLookupFactory
>   -- WFSTLookupFactory
>   -- TSTLookupFactory
>   --  JaspellLookupFactory
>    - Dictionary pluggability (give users the option to choose the dictionary implementation to use for their suggesters to consume)
>    -- Input from search index
>       --- DocumentDictionaryFactory – user can specify suggestion field along with optional weight and payload fields from their search index.
>       --- DocumentExpressionFactory – same as DocumentDictionaryFactory but allows users to specify arbitrary expression using existing numeric fields.
>      --- HighFrequencyDictionaryFactory – user can specify a suggestion field and specify a threshold to prune out less frequent terms.	
>    -- Input from external files
>      --- FileDictionaryFactory – user can specify a file which contains suggest entries, along with optional weights and payloads.
> Config (index time) options:
>   - name - name of suggester
>   - sourceLocation - external file location (for file-based suggesters)
>   - lookupImpl - type of lookup to use [default JaspellLookupFactory]
>   - dictionaryImpl - type of dictionary to use (lookup input) [default
>     (sourceLocation == null ? HighFrequencyDictionaryFactory : FileDictionaryFactory)]
>   - storeDir - location to store in-memory data structure in disk
>   - buildOnCommit - command to build suggester for every commit
>   - buildOnOptimize - command to build suggester for every optimize
> Query time options:
>   - suggest.dictionary - name of suggester to use (can occur multiple times for batching suggester requests)
>   - suggest.count - number of suggestions to return
>   - suggest.q - query to use for lookup
>   - suggest.build - command to build the suggester
>   - suggest.reload - command to reload the suggester
>   - buildAll – command to build all suggesters in the component
>   - reloadAll – command to reload all suggesters in the component
> Example query:
> {code}
> http://localhost:8983/solr/suggest?suggest.dictionary=suggester1&suggest=true&suggest.build=true&suggest.q=elec
> {code}
> Distributed query:
> {code}
> http://localhost:7574/solr/suggest?suggest.dictionary=suggester2&suggest=true&suggest.build=true&suggest.q=elec&shards=localhost:8983/solr,localhost:7574/solr&shards.qt=/suggest
> {code}	
> Response Format:
> The response format can be either XML or JSON. The typical response structure is as follows:
>  {code}
> {
>   suggest: {
>     suggester_name: {
>        suggest_query: { numFound:  .., suggestions: [ {term: .., weight: .., payload: ..}, .. ]} 
>    }
> }	
> {code}
>   
> Example Response:
> {code}
> {
>     responseHeader: {
>         status: 0,
>         QTime: 3
>     },
>     suggest: {
>         suggester1: {
>             e: {
>                 numFound: 1,
>                 suggestions: [
>                     {
>                         term: "electronics and computer1",
>                         weight: 100,
>                         payload: ""
>                     }
>                 ]
>             }
>         },
>         suggester2: {
>             e: {
>                 numFound: 1,
>                 suggestions: [
>                     {
>                         term: "electronics and computer1",
>                         weight: 10,
>                         payload: ""
>                     }
>                 ]
>             }
>         }
>     }
> }
> {code}
> Example solrconfig snippet with multiple suggester configuration:
> {code}  
>   <searchComponent name="suggest" class="solr.SuggestComponent">
>     <lst name="suggester">
>       <str name="name">suggester1</str>
>       <str name="lookupImpl">FuzzyLookupFactory</str>      
>       <str name="dictionaryImpl">DocumentDictionaryFactory</str>      
>       <str name="field">cat</str>
>       <str name="weightField">price</str>
>       <str name="suggestAnalyzerFieldType">string</str>
>     </lst>
>    <lst name="suggester">
>         <str name="name">suggester2 </str>
>         <str name="dictionaryImpl">DocumentExpressionDictionaryFactory</str>
>         <str name="lookupImpl">FuzzyLookupFactory</str>
>         <str name="field">product_name</str>
>         <str name="weightExpression">((price * 2) + ln(popularity))</str>
>         <str name="sortField">weight</str>
>         <str name="sortField">price</str>
>         <str name="strtoreDir">suggest_fuzzy_doc_expr_dict</str>
>         <str name="suggestAnalyzerFieldType">text</str>
>       </lst>  
> </searchComponent>
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org