You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by ot...@apache.org on 2004/03/02 14:56:03 UTC
cvs commit: jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cn ChineseTokenizer.java
otis 2004/03/02 05:56:03
Modified: contributions/analyzers/src/java/org/apache/lucene/analysis/cn
ChineseTokenizer.java
Log:
- Added documentation
Revision Changes Path
1.4 +18 -1 jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java
Index: ChineseTokenizer.java
===================================================================
RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- ChineseTokenizer.java 22 Jan 2004 20:54:47 -0000 1.3
+++ ChineseTokenizer.java 2 Mar 2004 13:56:03 -0000 1.4
@@ -64,6 +64,23 @@
* Rule: A Chinese character as a single token
* Copyright: Copyright (c) 2001
* Company:
+ *
+ * The difference between thr ChineseTokenizer and the
+ * CJKTokenizer (id=23545) is that they have different
+ * token parsing logic.
+ *
+ * Let me use an example. If having a Chinese text
+ * "C1C2C3C4" to be indexed, the tokens returned from the
+ * ChineseTokenizer are C1, C2, C3, C4. And the tokens
+ * returned from the CJKTokenizer are C1C2, C2C3, C3C4.
+ *
+ * Therefore the index the CJKTokenizer created is much
+ * larger.
+ *
+ * The problem is that when searching for C1, C1C2, C1C3,
+ * C4C2, C1C2C3 ... the ChineseTokenizer works, but the
+ * CJKTokenizer will not work.
+ *
* @author Yiyi Sun
* @version 1.0
*
@@ -149,4 +166,4 @@
}
}
-}
\ No newline at end of file
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org