You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2003/04/11 05:50:36 UTC

DO NOT REPLY [Bug 18933] New: - Add support for Chinese, Japanese, and Korean to the core build.

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933

Add support for Chinese, Japanese, and Korean to the core build.

           Summary: Add support for Chinese, Japanese, and Korean to the
                    core build.
           Product: Lucene
           Version: unspecified
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Analysis
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: Eric.Isakson@sas.com


Moved from todo.xml:

Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-
dev@jakarta.apache.org&msgId=330905 

and his sigram patch to StandardTokenizer.jj
http://nagoya.apache.org/eyebrowse/SearchList?listId=&listName=lucene-
dev@jakarta.apache.org&searchText=sigram&defaultField=subject&Search=Search

I know there was some discussion about keeping language variant analyzers out 
of the core a while back, but the sigram change to StandardTokenizer would make 
the StandardAnalyzer usable for Asian languages. From what I understand about 
searching in Asian languages the bigram approach used in CJKTokenizer will give 
better results. 

I'm not sure of the impact of this change on the QueryParser how/if either of 
these approaches makes sense along with some of the query syntax. For instance 
if I had the string of Chinese characters ABCDEFG (notice the lack of spaces 
between words in this language) and the actual words are AB, CDE and FG how 
would a Chinese user expect to enter a query that we would do in English as AB 
+CDE FG?

I wish I spoke one of these languages so I would have a better understanding of 
the search issues. Perhaps Che Dong can give us an idea of if/how a Chinese 
user would expect to interact with the query syntax. I know I've been 
struggling with that problem in my own app that needs to support Chinese and 
Japanese content.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org