You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ji...@apache.org on 2018/10/11 12:56:25 UTC

lucene-solr:master: LUCENE-8526: Add javadocs in CJKBigramFilter explaining the behavior of the StandardTokenizer on Hangul syllables.

Repository: lucene-solr
Updated Branches:
  refs/heads/master 971a0e3f4 -> c87778c50


LUCENE-8526: Add javadocs in CJKBigramFilter explaining the behavior of the StandardTokenizer on Hangul syllables.


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/c87778c5
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/c87778c5
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/c87778c5

Branch: refs/heads/master
Commit: c87778c50472ab81c6bfae7a5371f36a105544b3
Parents: 971a0e3
Author: Jim Ferenczi <ji...@apache.org>
Authored: Thu Oct 11 13:49:14 2018 +0100
Committer: Jim Ferenczi <ji...@apache.org>
Committed: Thu Oct 11 13:49:14 2018 +0100

----------------------------------------------------------------------
 .../java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java | 8 ++++++++
 1 file changed, 8 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/c87778c5/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
index bf4f621..7d79b84 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
@@ -43,6 +43,14 @@ import org.apache.lucene.util.ArrayUtil;
  * flag in {@link CJKBigramFilter#CJKBigramFilter(TokenStream, int, boolean)}.
  * This can be used for a combined unigram+bigram approach.
  * <p>
+ * Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.
+ * Korean Hangul characters are treated the same as many other scripts'
+ * letters, and as a result, StandardTokenizer can produce tokens that mix
+ * Hangul and non-Hangul characters, e.g. "한국abc".  Such mixed-script tokens
+ * are typed as <code>&lt;ALPHANUM&gt;</code> rather than
+ * <code>&lt;HANGUL&gt;</code>, and as a result, will not be converted to
+ * bigrams by CJKBigramFilter.
+ *
  * In all cases, all non-CJK input is passed thru unmodified.
  */
 public final class CJKBigramFilter extends TokenFilter {