You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/09 08:18:42 UTC

[GitHub] [lucene-solr] donnerpeter opened a new pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

donnerpeter opened a new pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330


   …o the misspelled word
   
   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   A follow up of the "ngram" suggestion support that adds single prefixes and suffixes to dictionary entries to get better suggestions
   
   # Solution
   
   Copy Hunspell's logic, extract some common code for FST traversal
   
   # Tests
   
   `allcaps.sug` from Hunspell repo
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572760859



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       I wonder if it makes sense to add something breakable in the middle, e.g. accepting some processor (unfortunately neither BiFunction nor BiPredicate convey that semantics for me :( ). OTOH I don't need it right now, and breakability can be added later. Or, it could be made a `Stream` or `Iterable`.
   
   One complication though: here I ignore all `IOException`s, but that's probably not a good idea in a general FST case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676998



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -84,33 +106,34 @@ private static String toString(IntsRef key) {
     return new String(chars);
   }
 
-  private boolean isSuitableRoot(IntsRef forms) {
+  private List<DictEntry> filterSuitableEntries(String word, IntsRef forms) {
+    List<DictEntry> result = new ArrayList<>();
     for (int i = 0; i < forms.length; i += dictionary.formStep()) {
       int entryId = forms.ints[forms.offset + i];
-      if (dictionary.hasFlag(entryId, dictionary.needaffix)

Review comment:
       needaffix check is moved into `expandRoot`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572696542



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       This might be worth moving to some util, e.g. `IntsRefFSTEnum`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572677318



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -132,14 +155,105 @@ private static int calcThreshold(String word) {
     return thresh / 3 - 1;
   }
 
-  private TreeSet<WeightedWord> rankBySimilarity(String word, List<WeightedWord> expanded) {
+  private List<String> expandRoot(DictEntry root, String misspelled) {

Review comment:
       Main change here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572766983



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       BiPredicate sounds pure to me, while this processing can have side effects. It's not in the javadoc, just in the name: predicates are something stateless.
   
   `IOException`s would be in the FST walking, the processing code itself doesn't necessarily need them (but can also have them).
   
   Maybe given all that it's just easier to leave the walking here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss merged pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

dweiss merged pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

dweiss commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572764206



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       A BiPredicate sounds good to me, actually... But if IOExceptions are to be allowed then you'd need a custom visitor interface anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572677980



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -798,8 +798,4 @@ private boolean isFlagAppendedByAffix(int affixId, char flag) {
     int appendId = dictionary.affixData(affixId, Dictionary.AFFIX_APPEND);
     return dictionary.hasFlag(appendId, flag);
   }
-
-  private boolean isCrossProduct(int affix) {

Review comment:
       moved to Dictionary




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676276



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);

Review comment:
       Just renamed a parameterized `WeightedWord`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572760859



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       I wonder if it makes sense to add something breakable in the middle, e.g. accepting some processor (unfortunately neither BiFunction nor BiPredicate convey that semantics for me :( ). OTOH I don't need it right now, and breakability can be added later. Or, it could be made a `Stream` or `Iterable`.
   
   One complication though: here I wrap all `IOException`s, but that's probably not a good idea in a general FST case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

dweiss commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572744570



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(
+        dictionary.words,
+        (key, forms) -> {
+          if (Math.abs(key.length - word.length()) > 4) return;
+
+          String root = toString(key);
+          List<DictEntry> entries = filterSuitableEntries(root, forms);
+          if (entries.isEmpty()) return;
+
+          if (originalCase == WordCase.LOWER
+              && WordCase.caseOf(root) == WordCase.TITLE
+              && !dictionary.hasLanguage("de")) {
+            return;
+          }
 
-      IntsRefFSTEnum.InputOutput<IntsRef> mapping;
-      while ((mapping = fstEnum.next()) != null) {
-        IntsRef key = mapping.input;
-        if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue;
-
-        String root = toString(key);
-        if (originalCase == WordCase.LOWER
-            && WordCase.caseOf(root) == WordCase.TITLE
-            && !dictionary.hasLanguage("de")) {
-          continue;
-        }
+          String lower = dictionary.toLowerCase(root);
+          int sc =
+              ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE))
+                  + commonPrefix(word, root);
 
-        String lower = dictionary.toLowerCase(root);
-        int sc =
-            ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root);
+          entries.forEach(e -> roots.add(new Weighted<>(e, sc)));
+        });
+    return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList());
+  }
 
-        roots.add(new WeightedWord(root, sc));
+  private void processFST(FST<IntsRef> fst, BiConsumer<IntsRef, IntsRef> keyValueConsumer) {

Review comment:
       Add a "forEach" method to fstenum, maybe? It'd correspond to Java collections then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…

Posted by GitBox <gi...@apache.org>.

donnerpeter commented on a change in pull request #2330:
URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676594



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java
##########
@@ -33,44 +40,59 @@
  */
 class GeneratingSuggester {
   private static final int MAX_ROOTS = 100;
-  private static final int MAX_GUESSES = 100;
+  private static final int MAX_WORDS = 100;
+  private static final int MAX_GUESSES = 200;
   private final Dictionary dictionary;
+  private final SpellChecker speller;
 
-  GeneratingSuggester(Dictionary dictionary) {
-    this.dictionary = dictionary;
+  GeneratingSuggester(SpellChecker speller) {
+    this.dictionary = speller.dictionary;
+    this.speller = speller;
   }
 
   List<String> suggest(String word, WordCase originalCase, Set<String> prevSuggestions) {
-    List<WeightedWord> roots = findSimilarDictionaryEntries(word, originalCase);
-    List<WeightedWord> expanded = expandRoots(word, roots);
-    TreeSet<WeightedWord> bySimilarity = rankBySimilarity(word, expanded);
+    List<Weighted<DictEntry>> roots = findSimilarDictionaryEntries(word, originalCase);
+    List<Weighted<String>> expanded = expandRoots(word, roots);
+    TreeSet<Weighted<String>> bySimilarity = rankBySimilarity(word, expanded);
     return getMostRelevantSuggestions(bySimilarity, prevSuggestions);
   }
 
-  private List<WeightedWord> findSimilarDictionaryEntries(String word, WordCase originalCase) {
-    try {
-      IntsRefFSTEnum<IntsRef> fstEnum = new IntsRefFSTEnum<>(dictionary.words);
-      TreeSet<WeightedWord> roots = new TreeSet<>();
+  private List<Weighted<DictEntry>> findSimilarDictionaryEntries(
+      String word, WordCase originalCase) {
+    TreeSet<Weighted<DictEntry>> roots = new TreeSet<>();
+    processFST(

Review comment:
       extracted FST traversal into a separate method




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org