You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brandon Fosdick <bf...@bfoz.net> on 2005/09/12 02:13:54 UTC

How do I handle Unicode inside XML request bits?

I'm not sure if this should be here or the modules list, but here goes anyway...

I'm working on a module for 2.0.x (mod_dav_userdir) and I'm using litmus for testing. In one of the tests where it sets a dead property on a resource it sends some XML with a unicode character in it, like so:

[<?xml version="1.0" encoding="utf-8" ?><propertyupdate xmlns='DAV:'><set><prop><high-unicode xmlns='http://webdav.org/neon/litmus/'>&#65536;</high-unicode></prop></set></propertyupdate>]

mod_dav_fs runs the XML through apr_xml_quote_elem() and apr_xml_to_text() before storing it in a database. So, naturally, I tried doing the same. But when I echo the text to the log I get gibberish and when its stored in MySQL it shows as '????'. Reading the text straight out of the provided apr_xml_elem also returns gibberish. The litmus test fails with the message:

18. propget............... FAIL (Property {http://webdav.org/neon/litmus/}high-unicode had value ????, expected 𐀀)

Apparently the original text ("&#65536") from the request is getting parsed into something that isn't useable. Is there some way to just read the orginal text and store that? What's the proper way to handle unicode characters?

Thanks

Re: How do I handle Unicode inside XML request bits?

Posted by André Malo <nd...@perlig.de>.
* William A. Rowe, Jr. wrote:

> Brandon Fosdick wrote:
> > Joe Orton wrote:
> >>The &#65536; character will be passed through in its four byte UTF-8
> >>form (which is 0xf4 0x80 0x80 0x80 I think)
>
> FYI - 65536 isn't a valid ucs-2 character; it is, however, a valid ucs-4
> character.
>
> That might be part of the origin of your issues, try 65535 as a MAX_VAL
> for ucs-2 (which would be a three-byte utf-8 value.)
>
> 65536 cannot be mapped to utf-8, but it can be mapped as a four byte
> utf-16 sequence.

Sure, it can. The utf-8 sequence is "\xf0\x90\x80\x80".

nd
-- 
Flhacs wird im Usenet grundsätzlich alsfhc geschrieben. Schreibt man
lafhsc nicht slfach, so ist das schlichtweg hclafs. Hingegen darf man
rihctig ruhig rhitcgi schreiben, weil eine shcalfe Schreibweise bei
irhictg nicht als shflac angesehen wird.       -- Hajo Pflüger in dnq

Re: How do I handle Unicode inside XML request bits?

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Brandon Fosdick wrote:
> Joe Orton wrote:
> 
>>The &#65536; character will be passed through in its four byte UTF-8 
>>form (which is 0xf4 0x80 0x80 0x80 I think)

FYI - 65536 isn't a valid ucs-2 character; it is, however, a valid ucs-4
character.

That might be part of the origin of your issues, try 65535 as a MAX_VAL
for ucs-2 (which would be a three-byte utf-8 value.)

65536 cannot be mapped to utf-8, but it can be mapped as a four byte
utf-16 sequence.

Bill

Re: How do I handle Unicode inside XML request bits?

Posted by Brandon Fosdick <bf...@bfoz.net>.
Joe Orton wrote:
> The &#65536; character will be passed through in its four byte UTF-8 
> form (which is 0xf4 0x80 0x80 0x80 I think); are you storing the 
> property values in the database in a UTF-8-safe field?  Problems with 
> this test are typically that some character conversion happens under 
> your feet or simply that the data store isn't 8-bit-safe.

The field I was storing it in was set to ascii_general_ci so I tried chaning it to utf8-bin and now it works. I can't look at the values in phpMyAdmin (they're displayed as blobs now), but I can live with that. Thanks for the help. 

BTW, litmus has been very helpful. Frustrating, but helpful. Although to be fair, most of my problems are due to my own ignorance.

Re: How do I handle Unicode inside XML request bits?

Posted by Joe Orton <jo...@redhat.com>.
On Sun, Sep 11, 2005 at 05:13:54PM -0700, Brandon Fosdick wrote:
...
> The litmus test fails with the message:
>
> 18. propget............... FAIL (Property 
> {http://webdav.org/neon/litmus/}high-unicode had value ????, expected 
> 𐀀)
> 
> Apparently the original text ("&#65536") from the request is getting 
> parsed into something that isn't useable. Is there some way to just 
> read the orginal text and store that? What's the proper way to handle 
> unicode characters?

The &#65536; character will be passed through in its four byte UTF-8 
form (which is 0xf4 0x80 0x80 0x80 I think); are you storing the 
property values in the database in a UTF-8-safe field?  Problems with 
this test are typically that some character conversion happens under 
your feet or simply that the data store isn't 8-bit-safe.

Regards,

joe