You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/28 08:37:08 UTC

[GitHub] [lucene-solr] donnerpeter opened a new pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

donnerpeter opened a new pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Capitalized ß is SS and some SS's decapitalize to ß (but only some)
   
   # Solution
   
   When decapitalizing, try replacing all reasonable subsets of SS occurrences with ß.
   
   # Tests
   
   `checksharps` from Hunspell/C++
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566672934



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r565907637



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/SpellChecker.java
##########
@@ -71,23 +71,39 @@ private boolean checkCaseVariants(char[] wordChars, WordCase wordCase) {
     char[] caseVariant = wordChars;
     if (wordCase == WordCase.UPPER) {
       caseVariant = stemmer.caseFoldTitle(caseVariant, wordChars.length);
-      if (checkWord(caseVariant, wordChars.length, true)) {
+      if (checkWord(caseVariant, wordChars.length, wordCase)) {
         return true;
       }
       char[] aposCase = Stemmer.capitalizeAfterApostrophe(caseVariant, wordChars.length);
-      if (aposCase != null && checkWord(aposCase, aposCase.length, true)) {
+      if (aposCase != null && checkWord(aposCase, aposCase.length, wordCase)) {
         return true;
       }
+      for (char[] variation : stemmer.sharpSVariations(caseVariant, wordChars.length)) {

Review comment:
       The duplication with `Stemmer`'s case analysis bugs me, and I'm thinking of ways to avoid it in some future PRs




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566626601



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =

Review comment:
       Do you think it makes sense to use language trickery (anonymous subclass of Object) just to hide those two methods? I see it's a recursive call to replaceSS but I'd just move it to static methods at the top class level and let them be used directly from there?

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       I don't think there is. There is an ancient discussion somewhere on jdk's mailing lists about making String.indexOf(String) more algorithmically efficient but it never made it through - it's hard to beat the naive algorithm on average, common case and if you have to deal with degenerates, it's advised to just roll out your own version.

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordCase.java
##########
@@ -58,4 +58,20 @@ private static WordCase get(boolean startsWithLower, boolean seenUpper, boolean
     }
     return seenUpper ? MIXED : LOWER;
   }
+
+  private static CharCase charCase(char c) {
+    if (Character.isUpperCase(c)) {
+      return CharCase.UPPER;
+    }
+    if (Character.isLowerCase(c) && Character.toUpperCase(c) != c) {
+      return CharCase.LOWER;
+    }
+    return CharCase.NEUTRAL;

Review comment:
       Ah, you learn things every day...

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =

Review comment:
       I also wonder if all that stream trickery is worth it in favor of a simple collector list pushed down recursively... Streams are nice, but a list requires less thinking (to me).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss merged pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
dweiss merged pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566654706



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =

Review comment:
       I understand the motives. But I tend to believe 25% of Java developers who see this construct will scratch their heads thinking what it does. :)
   
   Once you start using streams, memory allocation is no longer a concern... (static methods and argument passing would definitely not allocate more). 
   
   It's fine, leave it. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566639675



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =

Review comment:
       The anonymous class serves two purposes: avoiding auxiliary methods polluting class namespace and avoiding passing `char[] word, int length` in them. If you think the verbosity isn't worth that, I can make the methods static.
   
   Streams actually seem more natural to me for such a recursive enumeration. I can do the same with lists, but that will probably be a bit more verbose (and probably allocate a bit more, although that shouldn't matter much here). Should I?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#issuecomment-769661902


   Resolved merge conflicts with master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566635372



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       If it were a `String`, I'd use `indexOf`, but it's a `char[]`. Pity :(




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r565908473



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -198,17 +251,35 @@ WordCase caseOf(char[] word, int length) {
               true,
               false,
               false,
-              caseVariant));
+              originalCase));
     } catch (IOException bogus) {
       throw new RuntimeException(bogus);
     }
     return stems;
   }
 
-  private boolean acceptCase(boolean caseVariant, char[] wordFlags) {
-    return caseVariant
-        ? !Dictionary.hasFlag(wordFlags, dictionary.keepcase)
-        : !Dictionary.hasHiddenFlag(wordFlags);
+  private boolean acceptCase(
+      WordCase originalCase, char[] wordFlags, char[] word, int offset, int length) {
+    boolean keepCase = Dictionary.hasFlag(wordFlags, dictionary.keepcase);
+    if (originalCase != null) {
+      if (keepCase
+          && dictionary.checkSharpS
+          && originalCase == WordCase.TITLE
+          && containsSharpS(word, offset, length)) {
+        return true;
+      }
+      return !keepCase;
+    }
+    return !Dictionary.hasHiddenFlag(wordFlags);
+  }
+
+  private boolean containsSharpS(char[] word, int offset, int length) {
+    for (int i = 0; i < length; i++) {
+      if (word[i + offset] == 'ß') {

Review comment:
       ...or `ArrayUtil.indexOf(char[] array, int start, int end, char toFind)`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566665927



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       Yes, https://github.com/hunspell/hunspell/blob/master/src/hunspell/hunspell.hxx#L85




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r565908763



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordCase.java
##########
@@ -58,4 +58,20 @@ private static WordCase get(boolean startsWithLower, boolean seenUpper, boolean
     }
     return seenUpper ? MIXED : LOWER;
   }
+
+  private static CharCase charCase(char c) {
+    if (Character.isUpperCase(c)) {
+      return CharCase.UPPER;
+    }
+    if (Character.isLowerCase(c) && Character.toUpperCase(c) != c) {
+      return CharCase.LOWER;
+    }
+    return CharCase.NEUTRAL;

Review comment:
       Eszett is neutral




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566662360



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       I like the exponential candidate expansion guard limit (5) -- is this something empirical? I know it's German and its compounds make things more difficult but 5 double 's' still seems like a large limit to me. Is this from hunspell code too?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566665927



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       Yes, https://github.com/hunspell/hunspell/blob/master/src/hunspell/hunspell.hxx#L85 and 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r566657944



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =

Review comment:
       They might. But that might be educating and doesn't seem too elaborate :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2260: LUCENE-9704: Hunspell: support capitalization for German ß

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2260:
URL: https://github.com/apache/lucene-solr/pull/2260#discussion_r565908066



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java
##########
@@ -158,14 +173,52 @@ WordCase caseOf(char[] word, int length) {
     return null;
   }
 
+  List<char[]> sharpSVariations(char[] word, int length) {
+    if (!dictionary.checkSharpS) return Collections.emptyList();
+
+    Stream<String> result =
+        new Object() {
+          int findSS(int start) {
+            for (int i = start; i < length - 1; i++) {
+              if (word[i] == 's' && word[i + 1] == 's') {

Review comment:
       Is there anything like `ArrayUtil.indexOf(char[] haystack, char[] needle)`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org