You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ofbiz.apache.org by Adam Heath <do...@brainfood.com> on 2010/06/17 18:22:12 UTC

UTF-8 encoding and BOM java bug

The unicode specs say that a file 'may' start with a BOM(U+FEFF).  The
reader of the bytes can then look to see how the BOM is encoded, and
pick the correct encoding(UTF-8, UTF-16(le/be), UTF-32(le/be).  If the
file does start with a BOM, it must be removed.

A BOM anywhere else in the datastream is left alone.

However, lovely java doesn't do this correctly.  UTF-8 encodings do
*not* remove the BOM.  Only the others do.  The bug about this is at
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

I'm sending this to the list, because UTF-8 is the only sensible
encoding to use nowadays, and this might crop up here.  I don't really
have a fix yet.

I'm going to have to deal with this in webslinger, so I'll develop a
change there, and then alter the ofbiz code with the same kind of logic.