You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2003/04/11 05:50:36 UTC
DO NOT REPLY [Bug 18933] New: -
Add support for Chinese, Japanese, and Korean to the core build.
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933
Add support for Chinese, Japanese, and Korean to the core build.
Summary: Add support for Chinese, Japanese, and Korean to the
core build.
Product: Lucene
Version: unspecified
Platform: Other
OS/Version: Other
Status: NEW
Severity: Enhancement
Priority: Other
Component: Analysis
AssignedTo: lucene-dev@jakarta.apache.org
ReportedBy: Eric.Isakson@sas.com
Moved from todo.xml:
Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
dev@jakarta.apache.org&msgId=330905
and his sigram patch to StandardTokenizer.jj
http://nagoya.apache.org/eyebrowse/SearchList?listId=&listName=lucene-
dev@jakarta.apache.org&searchText=sigram&defaultField=subject&Search=Search
I know there was some discussion about keeping language variant analyzers out
of the core a while back, but the sigram change to StandardTokenizer would make
the StandardAnalyzer usable for Asian languages. From what I understand about
searching in Asian languages the bigram approach used in CJKTokenizer will give
better results.
I'm not sure of the impact of this change on the QueryParser how/if either of
these approaches makes sense along with some of the query syntax. For instance
if I had the string of Chinese characters ABCDEFG (notice the lack of spaces
between words in this language) and the actual words are AB, CDE and FG how
would a Chinese user expect to enter a query that we would do in English as AB
+CDE FG?
I wish I spoke one of these languages so I would have a better understanding of
the search issues. Perhaps Che Dong can give us an idea of if/how a Chinese
user would expect to interact with the query syntax. I know I've been
struggling with that problem in my own app that needs to support Chinese and
Japanese content.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org