You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Faulkner, Jeffrey" <je...@IDEXX.com> on 2005/07/01 16:10:50 UTC

UTF-8 indexing and searching

I'm trying to index and search html and jsp files that are saved using utf-8
encoding.  The pages are indexed on the file system using the
StandardAnalyzer.  The files can contain a mix of english, chinese,
japanese, etc. saved as utf-8.  Searches using english terms are successful
but none of the searches for the chinese or japanese characters are
successful.  I'm a little confused about dealing with different encodings to
begin with. I tried specifying the encoding as utf-8 when reading in the
files using an InputStream and also also converting the String retrieved
from a form submission using getBytes specifying utf-8 also without any
success.  I've heard the StandardAnalzyer isn't optimal at indexing these
other characters but I heard that it does provide some basic indexing
capability.  Is it possible that these characters are being indexed and
searched correctly but they just aren't being found because of the
limitations of the analyzer or do you think that the files are being indexed
or searched incorrectly due to some encoding issues?
Thanks for the input 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org