You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/03/21 09:43:57 UTC

[GitHub] [lucene] mocobeta opened a new pull request #26: LUCENE-9853: Use CJKWidthCharFilter as the default character width normalizer in JapaneseAnalyzer

mocobeta opened a new pull request #26:
URL: https://github.com/apache/lucene/pull/26


   https://issues.apache.org/jira/browse/LUCENE-9853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #26: LUCENE-9853: Use CJKWidthCharFilter as the default character width normalizer in JapaneseAnalyzer

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #26:
URL: https://github.com/apache/lucene/pull/26#discussion_r598268124



##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseAnalyzer.java
##########
@@ -39,21 +41,28 @@
   private final Mode mode;
   private final Set<String> stoptags;
   private final UserDictionary userDict;
+  private final boolean charNormalization;
 
   public JapaneseAnalyzer() {
     this(
         null,
         JapaneseTokenizer.DEFAULT_MODE,
         DefaultSetHolder.DEFAULT_STOP_SET,
-        DefaultSetHolder.DEFAULT_STOP_TAGS);
+        DefaultSetHolder.DEFAULT_STOP_TAGS,
+        true);
   }
 
   public JapaneseAnalyzer(
-      UserDictionary userDict, Mode mode, CharArraySet stopwords, Set<String> stoptags) {
+      UserDictionary userDict,
+      Mode mode,
+      CharArraySet stopwords,
+      Set<String> stoptags,
+      boolean charNormalization) {

Review comment:
       I think this a bit confusing, if set to `false`, character normalization is still performed, just a different place in the chain. 
   
   Do we really need this parameter? I think it would be better to document it well in CHANGES.txt. If the user wants different behavior they can make a Analyzer from the different components very easily?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #26: LUCENE-9853: Use CJKWidthCharFilter as the default character width normalizer in JapaneseAnalyzer

Posted by GitBox <gi...@apache.org>.

mocobeta commented on pull request #26:
URL: https://github.com/apache/lucene/pull/26#issuecomment-807727260


   @rmuir Thank you for reviewing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta merged pull request #26: LUCENE-9853: Use CJKWidthCharFilter as the default character width normalizer in JapaneseAnalyzer

Posted by GitBox <gi...@apache.org>.

mocobeta merged pull request #26:
URL: https://github.com/apache/lucene/pull/26


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a change in pull request #26: LUCENE-9853: Use CJKWidthCharFilter as the default character width normalizer in JapaneseAnalyzer

Posted by GitBox <gi...@apache.org>.

mocobeta commented on a change in pull request #26:
URL: https://github.com/apache/lucene/pull/26#discussion_r598772602



##########
File path: lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseAnalyzer.java
##########
@@ -39,21 +41,28 @@
   private final Mode mode;
   private final Set<String> stoptags;
   private final UserDictionary userDict;
+  private final boolean charNormalization;
 
   public JapaneseAnalyzer() {
     this(
         null,
         JapaneseTokenizer.DEFAULT_MODE,
         DefaultSetHolder.DEFAULT_STOP_SET,
-        DefaultSetHolder.DEFAULT_STOP_TAGS);
+        DefaultSetHolder.DEFAULT_STOP_TAGS,
+        true);
   }
 
   public JapaneseAnalyzer(
-      UserDictionary userDict, Mode mode, CharArraySet stopwords, Set<String> stoptags) {
+      UserDictionary userDict,
+      Mode mode,
+      CharArraySet stopwords,
+      Set<String> stoptags,
+      boolean charNormalization) {

Review comment:
       I thought it'd be better to provide this option for backward compatibility. But yes, users can easily switch to their own custom analyzer if they want. I'll remove the parameter; and add some documentation (MIGRATE entry) to switch back to the old behaviour.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org