You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jon Schuster <jo...@wrq.com> on 2004/07/02 22:49:19 UTC

Problems indexing Japanese with CJKAnalyzer

Hi,

I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.

I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.

When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as "u7aef", "u7af6", etc. Searching for an English word produces
results that have text like "code \u5c5e\u6027".

Since others have gotten Japanese indexing to work, what's the secret I'm
missing?

Thanks,
Jon


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Problems indexing Japanese with CJKAnalyzer

Posted by Steven Rowe <sa...@syr.edu>.

Hi Jon,

It sounds to me like you have a character encoding problem.  The 
native2ascii tool is designed to produce input for the Java compiler; 
the "\u7aef" notation you're seeing is understood by Java string 
interpreters to mean the corresponding hexadecimal Unicode code point. 
Other Java programs, however, depending on their implementation, may not 
understand this notation.  Alternatively, maybe the notation is 
understood, but the conversion from Shift-JIS to Java Unicode format is 
not being performed properly; if you don't tell native2ascii the source 
encoding, it will assume the "native" encoding for the platform--on 
Windows, depending on which localized version you've got, this is likely 
to be the so-called code page 1252 (ISO-8859-1 with a few 
modifications).  Converting from one character encoding to another with 
incorrect assumptions about the source encoding can only lead to sorrow 
and confusion.

I think you can use the native2ascii tool to do what you want 
(untested), but it will take two passes:

1. Use native2ascii to convert your file(s) to Java Unicode format, but 
tell it the source encoding:

    native2ascii -encoding SJIS inputfile outputfile1

2. Tell it to convert from Java Unicode format to UTF-8:

    native2ascii -reverse -encoding UTF8 outputfile1 finaloutput

Here's a web page with more information on native2ascii:

<URL:http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html>

Hope it helps,
Steve Rowe

Jon Schuster wrote:
> I've gone through all of the past messages regarding the CJKAnalyzer but I
> still must be doing something wrong because my searches don't work.
> 
> I'm using the IndexHTML application from the org.apache.lucene.demo package
> to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
> I've also tried with and without setting the file.encoding to Shift-JIS.
> I've tried indexing the HTML files, which contain Shift-JIS, without
> conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
> messages. I've also tried converting the Shift-JIS HTML files to Unicode by
> first running them through the native2ascii tool.
> 
> When the files are converted via native2ascii, they index without errors,
> but the index appears to contain the Unicode characters as literal strings
> such as "u7aef", "u7af6", etc. Searching for an English word produces
> results that have text like "code \u5c5e\u6027".
> 
> Since others have gotten Japanese indexing to work, what's the secret I'm
> missing?
> 
> Thanks,
> Jon

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org