You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/11 23:10:17 UTC

[GitHub] [lucene] hossman opened a new issue, #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

hossman opened a new issue, #12077:
URL: https://github.com/apache/lucene/issues/12077

   ### Description
   
   `WordBreakSpellChecker` has a `maxEvaluations` config option (default: 1000) which is suppose to be the "maximum number of word combinations to evaluate" but the way this setting is used in `generateBreakUpSuggestions()` makes no sense.
   
   The crux of this method is to iteratively loop over each character in the input to consider if splitting at that point in the string produces valid suggestion words on the left & right side of the split.  If the left side is a valid suggestion, then regardless of whether the right side was, recursively process the right side.
   
   As it's looping, it compares `maxEvaluations` to a `totalEvaluations` counter which is incremented on each iteration, as well as being passed down in each recursive call, breaking out of the loop if `totalEvaluations >= maxEvaluations`.
   
   But it *also* maintains a `thisTimeEvaluations` counter, which is only incremented based on the loop iterations, and this is what the method returns -- meaning that even if a nested recursive call exceeds `maxEvaluations`, the return value it gives back to it's caller is a much smaller number (bounded by the length of the string), and the caller will proceed to do more iterations (and more recursive calls)
   
   It's such a weird implementation choice (to explicitly maintain these two variables) that I'm sure my explanation above is hard to make sense of (it seems like it must have been intentional -- but i can't understand why?) and I confused myself 3 times trying to write it.
   
   The only way i found to really wrap my head around how "wrong" this is was by adding some sysout messages to the methods showing just how much iterating & recursing it does even when `maxEvaluations`.
   
   As a result of this behavior, it's actually possible for `WordBreakSpellChecker.suggestWordBreaks(...)` to return significantly more suggestions then the setting of `maxEvaluations`
   
   ---
   
   I'm attaching:
   * [WordBreakSpellChecker.nocommit.sysout.patch.txt](https://github.com/apache/lucene/files/10396330/WordBreakSpellChecker.nocommit.sysout.patch.txt) with the sysout logging i mentioned tp help visualize the recursion, and a trivial test case showing that even with `maxEvaluations=100` calling `suggestWordBreaks(...)` (on a 20 character input string) can generate 500+ suggested parsings.
   * [OUTPUT-org.apache.lucene.search.spell.TestWordBreakSpellChecker.txt](https://github.com/apache/lucene/files/10396366/OUTPUT-org.apache.lucene.search.spell.TestWordBreakSpellChecker.txt) with the output of the nocommit sysout println's from the above mentioned test -- which shows `generateSuggestWord()` (the method that looks up the `docFreq` to see if a "word" is valid) was called 35013 times!
   * [WordBreakSpellChecker.fix.patch.txt](https://github.com/apache/lucene/files/10396636/WordBreakSpellChecker.fix.patch.txt) with my suggested fix (and the same test which now passes)
   
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] hossman commented on issue #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

Posted by GitBox <gi...@apache.org>.
hossman commented on issue #12077:
URL: https://github.com/apache/lucene/issues/12077#issuecomment-1379633112

   FWIW: It also seems strange to me that this method is essentially doing a "depth first" walk of the possible splits, given that it's working a character at a time and the only possible `BreakSuggestionSortMethod` values start with `NUM_CHANGES_THEN_...`.
   
   it seems like we could get "better" results, with lower values of `maxEvaluations`, and less recursion (even if `maxChanges` is very large) if the logic was something like:
   
   * init a BitSet the same length as our input
   * loop over each character postion (`i`)
     * break if `totalEvaluations >= maxEvaluations` otherwise increment `totalEvaluations`
     * if `leftWord` is "valid" suggestion, record `i` in our bitset
     * if `rightWord` is also a "valid" suggestion, offer this left+right combo to our `suggestions` queue
   * if `numberBreaks` has not yet exceeded `maxChanges`:
     * loop over each set bit (`i`) in our BitSet:
       * break if `totalEvaluations >= maxEvaluations` otherwise increment `totalEvaluations`
       * recursively parse the portion of our input to the "right" of `i` (in the context of a new prefix using the portion to the "left" of `i` and an incremented `numberBreaks`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] hossman commented on issue #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

Posted by "hossman (via GitHub)" <gi...@apache.org>.
hossman commented on issue #12077:
URL: https://github.com/apache/lucene/issues/12077#issuecomment-1399633507

   I spun the "breadth first" improvement idea off into #12100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] hossman closed issue #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense

Posted by "hossman (via GitHub)" <gi...@apache.org>.
hossman closed issue #12077: WordBreakSpellChecker.maxEvaluations usage in generateBreakUpSuggestions() makes no sense
URL: https://github.com/apache/lucene/issues/12077


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org