You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2022/09/27 21:44:00 UTC
[jira] [Updated] (SOLR-16436) DirectSolrSpellChecher: maxQueryFrequency bug in multi-shard
[ https://issues.apache.org/jira/browse/SOLR-16436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-16436:
--------------------------------------
Attachment: SOLR-16436.patch
Status: Open (was: Open)
The "false positive" situation seems fairly straightforward to fix: Give the {{DirectSpellChecker}} instance on the coordinator node the ability to modify the Shard Requests, and participate in the "merge" of Shard Responses, so it can request extended results and sum up the {{origFreq}} of each original term and decide when/if to ignore suggestions from the individual shards because the term is in fact frequent enough (in the collection as a whole) that it should not generate suggestions.
I'm attaching a patch w/test that demonstrates this improvement.
----
I don't really have any good ideas for addressing the "false negative" situation ... I'm hoping someone else reading this will have a "Eureka! moment" and suggest a straight forward solution that i'm not considering. Worst case (If no one chimes in with an idea soon) I'll update the patch to modify the ref-guide to note that when configuring {{maxQueryFrequency}} as a percentage, it's evaluated {_}per shard{_}.
> DirectSolrSpellChecher: maxQueryFrequency bug in multi-shard
> -------------------------------------------------------------
>
> Key: SOLR-16436
> URL: https://issues.apache.org/jira/browse/SOLR-16436
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: spellchecker
> Reporter: Chris M. Hostetter
> Assignee: Chris M. Hostetter
> Priority: Major
> Attachments: SOLR-16436.patch
>
>
> {{DirectSolrSpellChecher}} has some very confusing/unexpected behavior when:
> * {{maxQueryFrequency}} is configured
> * In a multi-shard collection
> * Using {{thresholdTokenFrequency}} or {{spellcheck.onlyMorePopular=true}} or {{spellcheck.alternativeTermCount}}
> ** (ie: anything that cause {{SuggestMode != SUGGEST_WHEN_NOT_IN_INDEX}} so suggestions are possible even for terms in the index)
> The nature of the unexpected behavior varies depending on whether {{maxQueryFrequency}} is configured as a float less then 1 (ie: a percentage relative to the maxDocs in the index) or an integer greater then 1 (ie: an absolute max frequency):
> * When {{maxQueryFrequency < 1}} (ie: "percentage of maxDocs")
> ** It's possible to get "false negative" suggestions
> *** ie: a term that _should_ generate suggestions (and would in an equivalent single-shard deployment) *does not*
> ** A term from the original query may not exist in enough total documents then the configured {{maxQueryFrequency}} percentage across the entire collection, but will not return suggestions
> ** This can happen if a term exists in more then the configured {{maxQueryFrequency}} percentage of docs on _one (or more)_ individual shards
> *** As long as at least one shard says the term is "correctly spelled" (which is what {{DirectSolrSpellChecher}} decides when the {{maxQueryFrequency}} threshold is met) then the merge logic ignores any suggestions that might come from other shards
> * When {{1 < maxQueryFrequency}} (ie "absolute value")
> ** It's possible to get "false positive" suggestions
> *** ie: a term that _should not_ generate suggestions (and would not in an equivalent single-shard deployment) *does*
> ** A term from the original query may exist in more total documents in the collection then the configured {{maxQueryFrequency}} but will still return suggestions
> ** This can happen if a term exists in fewer then the configured {{maxQueryFrequency}} number of docs on _every_ individual shard
> *** Since no shard says the term is "correctly spelled", the suggestions are merged and returned
> *** No aspect of the code considers the possibility that the sum of the {{origFreq}} returned by all shards might be higher then the specified {{maxQueryFrequency}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org