You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ji...@apache.org on 2018/10/11 12:56:25 UTC
lucene-solr:master: LUCENE-8526: Add javadocs in CJKBigramFilter
explaining the behavior of the StandardTokenizer on Hangul syllables.
Repository: lucene-solr
Updated Branches:
refs/heads/master 971a0e3f4 -> c87778c50
LUCENE-8526: Add javadocs in CJKBigramFilter explaining the behavior of the StandardTokenizer on Hangul syllables.
Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/c87778c5
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/c87778c5
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/c87778c5
Branch: refs/heads/master
Commit: c87778c50472ab81c6bfae7a5371f36a105544b3
Parents: 971a0e3
Author: Jim Ferenczi <ji...@apache.org>
Authored: Thu Oct 11 13:49:14 2018 +0100
Committer: Jim Ferenczi <ji...@apache.org>
Committed: Thu Oct 11 13:49:14 2018 +0100
----------------------------------------------------------------------
.../java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java | 8 ++++++++
1 file changed, 8 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/c87778c5/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
index bf4f621..7d79b84 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java
@@ -43,6 +43,14 @@ import org.apache.lucene.util.ArrayUtil;
* flag in {@link CJKBigramFilter#CJKBigramFilter(TokenStream, int, boolean)}.
* This can be used for a combined unigram+bigram approach.
* <p>
+ * Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.
+ * Korean Hangul characters are treated the same as many other scripts'
+ * letters, and as a result, StandardTokenizer can produce tokens that mix
+ * Hangul and non-Hangul characters, e.g. "한국abc". Such mixed-script tokens
+ * are typed as <code><ALPHANUM></code> rather than
+ * <code><HANGUL></code>, and as a result, will not be converted to
+ * bigrams by CJKBigramFilter.
+ *
* In all cases, all non-CJK input is passed thru unmodified.
*/
public final class CJKBigramFilter extends TokenFilter {