You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/06/13 15:01:02 UTC

[GitHub] [lucene-solr] johtani opened a new pull request #1577: LUCENE-9390: JapaneseTokenizer discards token that is all punctuation characters only

johtani opened a new pull request #1577:
URL: https://github.com/apache/lucene-solr/pull/1577


   # Description
   
   Check and omit token that has all punctuation characters when discard punctuation flag is true.
   Currently, JapaneseTokenizer discards token that has punctuation at first character only.
   
   # Solution
   
   Add isAllPunctuation method for testing token.
   
   # Tests
   
   Ensure to discard if token is all punctuation characters.
   And not discard if token that start punctuation character and has non-punctuation character.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `ant precommit` and the appropriate test suite.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jimczi commented on a change in pull request #1577: LUCENE-9390: JapaneseTokenizer discards token that is all punctuation characters only

Posted by GitBox <gi...@apache.org>.

jimczi commented on a change in pull request #1577:
URL: https://github.com/apache/lucene-solr/pull/1577#discussion_r440654020



##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java
##########
@@ -1917,4 +1917,15 @@ private static boolean isPunctuation(char ch) {
         return false;
     }
   }
+
+  private static boolean isAllCharPunctuation(char[] ch, int offset, int length) {
+    boolean flag = true;
+    for (int i = offset; i < offset + length; i++) {
+      if (!isPunctuation(ch[i])) {
+        flag = false;
+        break;
+      }
+    }
+    return flag;

Review comment:
       return `true` ?

##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java
##########
@@ -1917,4 +1917,15 @@ private static boolean isPunctuation(char ch) {
         return false;
     }
   }
+
+  private static boolean isAllCharPunctuation(char[] ch, int offset, int length) {
+    boolean flag = true;
+    for (int i = offset; i < offset + length; i++) {
+      if (!isPunctuation(ch[i])) {
+        flag = false;

Review comment:
       nit: you can return `false` directly ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] johtani commented on a change in pull request #1577: LUCENE-9390: JapaneseTokenizer discards token that is all punctuation characters only

Posted by GitBox <gi...@apache.org>.

johtani commented on a change in pull request #1577:
URL: https://github.com/apache/lucene-solr/pull/1577#discussion_r441647603



##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java
##########
@@ -1917,4 +1917,15 @@ private static boolean isPunctuation(char ch) {
         return false;
     }
   }
+
+  private static boolean isAllCharPunctuation(char[] ch, int offset, int length) {
+    boolean flag = true;
+    for (int i = offset; i < offset + length; i++) {
+      if (!isPunctuation(ch[i])) {
+        flag = false;
+        break;
+      }
+    }
+    return flag;

Review comment:
       Fixed this.

##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java
##########
@@ -1917,4 +1917,15 @@ private static boolean isPunctuation(char ch) {
         return false;
     }
   }
+
+  private static boolean isAllCharPunctuation(char[] ch, int offset, int length) {
+    boolean flag = true;
+    for (int i = offset; i < offset + length; i++) {
+      if (!isPunctuation(ch[i])) {
+        flag = false;

Review comment:
       Fixed this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] johtani commented on pull request #1577: LUCENE-9390: JapaneseTokenizer discards token that is all punctuation characters only

Posted by GitBox <gi...@apache.org>.

johtani commented on pull request #1577:
URL: https://github.com/apache/lucene-solr/pull/1577#issuecomment-645467080


   I added NBest test case. And also I changed registerNode. 
   However, there is no difference between changing it or not changing it...
   Am I missing something test case?
   
   For NBest test case with discard punctuation, the tokenizer outputs a complicated token stream, so [I set `graphOffsetsAreCorrect` is `false`](https://github.com/apache/lucene-solr/blob/abf243c5cec331ec8419f0fd7c966dbce45f6b2d/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/TestJapaneseTokenizer.java#L967).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org