You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by saisantoshi <sa...@gmail.com> on 2013/01/08 19:20:00 UTC

Lucene support for multi byte characters : 2.4.0 (version).

We are using Lucene (2.4.0 libraries) for implementing search in our
application. We are using Standard Analyzer for Analyzer part.

Our application has a documents upload feature which lets you upload the
documents and be able to put in some keywords (while uploading it). When we
search (using the keywords), the search will retrieve the documents based on
the keywords.

The problem that we are facing is the search works fine if the keywords are
in English or Simplified Chinese but is not supporting Japanese.

I am not sure if its the problem with the Analyzer that we are using or is
not being supported in 2.4.0 version (Japanese Characters). I did find the
following below doing a Google search.

https://issues.apache.org/jira/browse/LUCENE-2847 ( support all of the
unicode)

http://lucene.472066.n3.nabble.com/which-unicode-version-is-supported-with-lucene-td2574222.html

We are not tokenizing the document, we are only tokenizing the keywords
added while uploading the document.

document.add(new Field(field.getKeyword(), value, Field.Store.NO,
Field.Index.ANALYZED));

Do you think upgrading to the latest version of the Lucene would solve the
issue? or do we need to use special analyzers for each specific language?
Does the Standard Analyzer does not support Unicode characters?

Any thoughts on this is much appreciated?

Thanks,
Sai

--
View this message in context: http://lucene.472066.n3.nabble.com/Lucene-support-for-multi-byte-characters-2-4-0-version-tp4031654.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org