You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gereon Steffens <ge...@steffens.org> on 2007/05/02 09:58:47 UTC

UTF-8 2-byte vs 4-byte encodings

Hi,

I have a question regarding UTF-8 encodings, illustrated by the
utf8-example.xml file. This file contains raw, unescaped UTF8 characters,
for example the "e acute" character, represented as two bytes 0xC3 0xA9.
When this file is added to Solar and retrieved later, the XML output
contains a four-byte representation of that character, namely 0xC2 0x83
0xC2 0xA9.

If, on the other hand, the input data contains this same character as an
entity &#A9; the output contains the two-byte encoded representation 0xC3
0xA9.

Why is that so, and is there a way to always get characters like these out
of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in
my input files that contain raw (two-byte) UTF8 characters that can't be
encoded as entities.

Thanks,
Gereon


Re: AW: UTF-8 2-byte vs 4-byte encodings

Posted by Gereon Steffens <ge...@steffens.org>.
Hi Chrisitian,

> It is not sufficient to set the encoding in the XML but
> you need an additional HTTP header to set the encoding ("Content-type:
> text/xml; charset=UTF-8")
Thanks, that's what I was missing.

Gereon


AW: UTF-8 2-byte vs 4-byte encodings

Posted by "Burkamp, Christian" <C....@Ceyoniq.com>.
Gereon,

The four bytes do not look like a valid utf-8 encoded character. 4-byte characters in utf-8 start with the binary sequence "11110...". (For reference see the excellent wikipedia article on utf-8 encoding).
Your problem looks like someone interpreted your valid 2-byte utf-8 encoded character as two single byte characters in some fancy encoding. This happens if you send XML updates to solr via http without setting the encoding properly. It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding ("Content-type: text/xml; charset=UTF-8")

--Christian

-----Ursprüngliche Nachricht-----
Von: Gereon Steffens [mailto:gereon@steffens.org] 
Gesendet: Mittwoch, 2. Mai 2007 09:59
An: solr-user@lucene.apache.org
Betreff: UTF-8 2-byte vs 4-byte encodings


Hi,

I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the "e acute" character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9.

If, on the other hand, the input data contains this same character as an entity &#A9; the output contains the two-byte encoded representation 0xC3 0xA9.

Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities.

Thanks,
Gereon