You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by HUYLEBROECK Jeremy RD-ILAB-SSF <je...@orange-ftgroup.com> on 2007/04/28 00:21:49 UTC

Unicode characters

Hi,

We experience some encoding probs with the unicode characters getting
out of solr.
Let me explain our flow:

-fetch a web page
-decode entities and unicode characters(such as $#149; ) using Neko
library
-get a unicode String in Java
-Sent it to SOLR through XML created by SAX, with the right encoding
(UTF-8) specified everywhere( writer, header etc...)
-it apparently arrives clean on the SOLR side (verified in our logs).
-In the query output from SOLR (XML message), the character is not
encoded as an entity (not &#149;) but the character itself is used
(character 149=95 hexadecimal).

And we can see in firefox and our logs a "code" instead of the character
(code 149 or 95), even if the original XML message sent to SOLR was
properly rendered in Firefox, or the shell etc...

We might have missed something somewhere as we easily get our minds lost
in all the encoding/unicode nightmare ;)

We've seen the method "escape" in the XML Object of SOLR. It escapes
only a few codes as Entities. Could it be the source of our prob?
What would be the right approach to encode properly our input without
having to tweak solr code? Or is it a bug?

Thanks

Jeremy.

Re: AW: UTF-8 2-byte vs 4-byte encodings

Posted by Gereon Steffens <ge...@steffens.org>.

Hi Chrisitian,

> It is not sufficient to set the encoding in the XML but
> you need an additional HTTP header to set the encoding ("Content-type:
> text/xml; charset=UTF-8")
Thanks, that's what I was missing.

Gereon

AW: UTF-8 2-byte vs 4-byte encodings

Posted by "Burkamp, Christian" <C....@Ceyoniq.com>.

Gereon,

The four bytes do not look like a valid utf-8 encoded character. 4-byte characters in utf-8 start with the binary sequence "11110...". (For reference see the excellent wikipedia article on utf-8 encoding).
Your problem looks like someone interpreted your valid 2-byte utf-8 encoded character as two single byte characters in some fancy encoding. This happens if you send XML updates to solr via http without setting the encoding properly. It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding ("Content-type: text/xml; charset=UTF-8")

--Christian

-----Ursprüngliche Nachricht-----
Von: Gereon Steffens [mailto:gereon@steffens.org]
Gesendet: Mittwoch, 2. Mai 2007 09:59
An: solr-user@lucene.apache.org
Betreff: UTF-8 2-byte vs 4-byte encodings

Hi,

I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the "e acute" character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9.

If, on the other hand, the input data contains this same character as an entity &#A9; the output contains the two-byte encoded representation 0xC3 0xA9.

Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities.

Thanks,
Gereon

UTF-8 2-byte vs 4-byte encodings

Posted by Gereon Steffens <ge...@steffens.org>.

Hi,

I have a question regarding UTF-8 encodings, illustrated by the
utf8-example.xml file. This file contains raw, unescaped UTF8 characters,
for example the "e acute" character, represented as two bytes 0xC3 0xA9.
When this file is added to Solar and retrieved later, the XML output
contains a four-byte representation of that character, namely 0xC2 0x83
0xC2 0xA9.

If, on the other hand, the input data contains this same character as an
entity &#A9; the output contains the two-byte encoded representation 0xC3
0xA9.

Why is that so, and is there a way to always get characters like these out
of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in
my input files that contain raw (two-byte) UTF8 characters that can't be
encoded as entities.

Thanks,
Gereon

RE: Unicode characters

Posted by HUYLEBROECK Jeremy RD-ILAB-SSF <je...@orange-ftgroup.com>.

Thanks a lot for the time you spent understanding my problem and
checking for a solution in Neko!
It helps a lot.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Friday, April 27, 2007 4:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Unicode characters 

: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not &#149;) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but
that would make the XML responses a lot bigger ... so in general Solr
only escapes the characters that need to be escaped to have a valid
UTF-8 XML response.

Your may also be having some additional problems since 149 (hex 95) is
not a printable UTF-8 character, it's a control character
(MESSAGE_WAITING) ... it sounds like you're dealing with HTML where
people were using the numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
    http://people.apache.org/~andyc/neko/doc/html/settings.html
    "http://cyberneko.org/html/features/scanner/fix-mswindows-refs
     Specifies whether to fix character entity references for Microsoft
     Windows characters as described at
     http://www.cs.tut.fi/~jkorpela/www/windows-chars.html."

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)

-Hoss

Re: Unicode characters

Posted by Chris Hostetter <ho...@fucit.org>.

: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not &#149;) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but that
would make the XML responses a lot bigger ... so in general Solr only
escapes the characters that need to be escaped to have a valid UTF-8 XML
response.

Your may also be having some additional problems since 149 (hex 95) is not
a printable UTF-8 character, it's a control character (MESSAGE_WAITING)
... it sounds like you're dealing with HTML where people were using the
numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
    http://people.apache.org/~andyc/neko/doc/html/settings.html
    "http://cyberneko.org/html/features/scanner/fix-mswindows-refs
     Specifies whether to fix character entity references for Microsoft
     Windows characters as described at
     http://www.cs.tut.fi/~jkorpela/www/windows-chars.html."

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)


-Hoss

Re: Unicode characters

Posted by Yonik Seeley <yo...@apache.org>.

On 4/27/07, HUYLEBROECK Jeremy RD-ILAB-SSF
> -In the query output from SOLR (XML message), the character is not
> encoded as an entity (not &#149;) but the character itself is used
> (character 149=95 hexadecimal).

That's fine, as they are equivalent representations, and that
character is directly representable in UTF-8 (which Solr uses for it's
output).
Is this causing a problem for you somehow?

-Yonik