You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jon Schuster <jo...@wrq.com> on 2004/07/02 22:49:19 UTC
Problems indexing Japanese with CJKAnalyzer
Hi,
I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.
I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.
When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as "u7aef", "u7af6", etc. Searching for an English word produces
results that have text like "code \u5c5e\u6027".
Since others have gotten Japanese indexing to work, what's the secret I'm
missing?
Thanks,
Jon
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Problems indexing Japanese with CJKAnalyzer
Posted by Steven Rowe <sa...@syr.edu>.
Hi Jon,
It sounds to me like you have a character encoding problem. The
native2ascii tool is designed to produce input for the Java compiler;
the "\u7aef" notation you're seeing is understood by Java string
interpreters to mean the corresponding hexadecimal Unicode code point.
Other Java programs, however, depending on their implementation, may not
understand this notation. Alternatively, maybe the notation is
understood, but the conversion from Shift-JIS to Java Unicode format is
not being performed properly; if you don't tell native2ascii the source
encoding, it will assume the "native" encoding for the platform--on
Windows, depending on which localized version you've got, this is likely
to be the so-called code page 1252 (ISO-8859-1 with a few
modifications). Converting from one character encoding to another with
incorrect assumptions about the source encoding can only lead to sorrow
and confusion.
I think you can use the native2ascii tool to do what you want
(untested), but it will take two passes:
1. Use native2ascii to convert your file(s) to Java Unicode format, but
tell it the source encoding:
native2ascii -encoding SJIS inputfile outputfile1
2. Tell it to convert from Java Unicode format to UTF-8:
native2ascii -reverse -encoding UTF8 outputfile1 finaloutput
Here's a web page with more information on native2ascii:
<URL:http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html>
Hope it helps,
Steve Rowe
Jon Schuster wrote:
> I've gone through all of the past messages regarding the CJKAnalyzer but I
> still must be doing something wrong because my searches don't work.
>
> I'm using the IndexHTML application from the org.apache.lucene.demo package
> to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
> I've also tried with and without setting the file.encoding to Shift-JIS.
> I've tried indexing the HTML files, which contain Shift-JIS, without
> conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
> messages. I've also tried converting the Shift-JIS HTML files to Unicode by
> first running them through the native2ascii tool.
>
> When the files are converted via native2ascii, they index without errors,
> but the index appears to contain the Unicode characters as literal strings
> such as "u7aef", "u7af6", etc. Searching for an English word produces
> results that have text like "code \u5c5e\u6027".
>
> Since others have gotten Japanese indexing to work, what's the secret I'm
> missing?
>
> Thanks,
> Jon
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org