You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bill Janssen <ja...@parc.com> on 2005/08/28 03:56:16 UTC

Re: Lucene does NOT use UTF-8.

Thanks for pointing this out, Marvin.  I wish Sun (or someone) would
document and register this particular character set encoding with
IANA, so that it could be used outside of Java.  As it stands now,
it's essentially a bastard encoding, good for nothing, and one of the
warts of Java.

Lucene probably shouldn't be using it in its file formats.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene does NOT use UTF-8.

Posted by Dave Kor <s0...@sms.ed.ac.uk>.
http://java.sun.com/docs/books/tutorial/i18n/text/stream.html

Yes, its confusing. Sun calls its own encoding format as "Unicode" and the above
webpage talks about how to convert between Java's Unicode format and the UTF-8
format.

Its just a matter of specifying "UTF-8" when creating output streams. I may have
remembered wrongly, but I do seem to recall old Lucene code that was designed to
run on JDK 1.1 actually did this "UTF-8" conversion.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org