You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2022/09/27 21:44:00 UTC

[jira] [Updated] (SOLR-16436) DirectSolrSpellChecher: maxQueryFrequency bug in multi-shard

     [ https://issues.apache.org/jira/browse/SOLR-16436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris M. Hostetter updated SOLR-16436:
--------------------------------------
    Attachment: SOLR-16436.patch
        Status: Open  (was: Open)

The "false positive" situation seems fairly straightforward to fix: Give the {{DirectSpellChecker}} instance on the coordinator node the ability to modify the Shard Requests, and participate in the "merge" of Shard Responses, so it can request extended results and sum up the {{origFreq}} of each original term and decide when/if to ignore suggestions from the individual shards because the term is in fact frequent enough (in the collection as a whole) that it should not generate suggestions.

I'm attaching a patch w/test that demonstrates this improvement.
----
I don't really have any good ideas for addressing the "false negative" situation ... I'm hoping someone else reading this will have a "Eureka! moment" and suggest a straight forward solution that i'm not considering. Worst case (If no one chimes in with an idea soon) I'll update the patch to modify the ref-guide to note that when configuring {{maxQueryFrequency}} as a percentage, it's evaluated {_}per shard{_}.

> DirectSolrSpellChecher: maxQueryFrequency bug in multi-shard 
> -------------------------------------------------------------
>
>                 Key: SOLR-16436
>                 URL: https://issues.apache.org/jira/browse/SOLR-16436
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: spellchecker
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-16436.patch
>
>
> {{DirectSolrSpellChecher}} has some very confusing/unexpected behavior when:
>  * {{maxQueryFrequency}} is configured
>  * In a multi-shard collection
>  * Using {{thresholdTokenFrequency}} or {{spellcheck.onlyMorePopular=true}} or {{spellcheck.alternativeTermCount}}
>  ** (ie: anything that cause {{SuggestMode != SUGGEST_WHEN_NOT_IN_INDEX}} so suggestions are possible even for terms in the index)
> The nature of the unexpected behavior varies depending on whether {{maxQueryFrequency}} is configured as a float less then 1 (ie: a percentage relative to the maxDocs in the index) or an integer greater then 1 (ie: an absolute max frequency):
>  * When {{maxQueryFrequency < 1}} (ie: "percentage of maxDocs")
>  ** It's possible to get "false negative" suggestions
>  *** ie: a term that _should_ generate suggestions (and would in an equivalent single-shard deployment) *does not*
>  ** A term from the original query may not exist in enough total documents then the configured {{maxQueryFrequency}} percentage across the entire collection, but will not return suggestions
>  ** This can happen if a term exists in more then the configured {{maxQueryFrequency}} percentage of docs on _one (or more)_ individual shards
>  *** As long as at least one shard says the term is "correctly spelled" (which is what {{DirectSolrSpellChecher}} decides when the {{maxQueryFrequency}} threshold is met) then the merge logic ignores any suggestions that might come from other shards
>  * When {{1 < maxQueryFrequency}} (ie "absolute value")
>  ** It's possible to get "false positive" suggestions
>  *** ie: a term that _should not_ generate suggestions (and would not in an equivalent single-shard deployment) *does*
>  ** A term from the original query may exist in more total documents in the collection then the configured {{maxQueryFrequency}} but will still return suggestions
>  ** This can happen if a term exists in fewer then the configured {{maxQueryFrequency}} number of docs on _every_ individual shard
>  *** Since no shard says the term is "correctly spelled", the suggestions are merged and returned
>  *** No aspect of the code considers the possibility that the sum of the {{origFreq}} returned by all shards might be higher then the specified {{maxQueryFrequency}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org