You are viewing a plain text version of this content. The canonical link for it is here.
Posted to wss4j-dev@ws.apache.org by José Ferreiro <jo...@gmail.com> on 2008/11/13 18:53:39 UTC

Encoding of non latin characters (Cyrillic)

Hello all,

I have the Russian Word (taken as example): *Основное*

that is encoded as UTF-8 by axis as:

&#x41E;&#x441;&#x43D;&#x43E;&#x432;&#x43D;&#x43E;&#x435; (a)

I may transmit this kind of information in a XML well formed packet using
axis 1.4 after a client request from the server to the client again. There
is no problem. The deserialization works perfectly.


However if I try to transmit applying wss4j with encryption signature and
timestamp the following error arises:

org.apache.xml.security.encryption.XMLEncryptionException: An invalid XML
character (Unicode: 0x1e)
was found in the element content of the document.

Therefore in order to avoid invalid characters in the packet I decide then
to escape all XML chars
using org.apache.commons.lang.StringEscapeUtils.escapeXML [1]



In the client in order to recover the original world I decide to do an
unescapeXML [1], which gives this Unicode string:

&#1054;&#1089;&#1085;&#1086;&#1074;&#1085;&#1086;&#1077; (b)

First, it should be concluded that I am not getting the same Unicode string
as at the beginning (a) where [(a) != (b)]

I was then wondering what kind of encoding I got.
I looked at this web site http://2cyr.com/decode/?lang=en to understand more
and it looks like I got windows-1251 (see [2])
that can be displayed in a browser as encoding="iso8859-1".

*My question is: Why didn't i get UTF-8 and how is it possible I got (b)
?????*


Thank you for your reading and any comments you might have.

José Ferreiro

Many thanks to Martin Gainty and Ognjen Blagojevic for already commeting and
helping in another thread I posted.


[1] -
http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html
[2] - http://en.wikipedia.org/wiki/CP1251

PS: Thanks to Martin and

-- 
José Ferreiro
MSc in Communication Systems, EPFL.

Re: Encoding of non latin characters (Cyrillic)

Posted by Andreas Veithen <an...@gmail.com>.
José,

Neither (a) nor (b) are UTF-8. These are sequences of XML character
entities referring to Unicode code points. They are strictly the same,
except that the first one uses hexadecimal values, while the second
one uses decimal values.

Andreas

On Thu, Nov 13, 2008 at 18:53, José Ferreiro <jo...@gmail.com> wrote:
> Hello all,
>
> I have the Russian Word (taken as example): Основное
>
> that is encoded as UTF-8 by axis as:
>
> &#x41E;&#x441;&#x43D;&#x43E;&#x432;&#x43D;&#x43E;&#x435; (a)
>
> I may transmit this kind of information in a XML well formed packet using
> axis 1.4 after a client request from the server to the client again. There
> is no problem. The deserialization works perfectly.
>
>
> However if I try to transmit applying wss4j with encryption signature and
> timestamp the following error arises:
>
> org.apache.xml.security.encryption.XMLEncryptionException: An invalid XML
> character (Unicode: 0x1e)
> was found in the element content of the document.
>
> Therefore in order to avoid invalid characters in the packet I decide then
> to escape all XML chars
> using org.apache.commons.lang.StringEscapeUtils.escapeXML [1]
>
>
>
> In the client in order to recover the original world I decide to do an
> unescapeXML [1], which gives this Unicode string:
>
> &#1054;&#1089;&#1085;&#1086;&#1074;&#1085;&#1086;&#1077; (b)
>
> First, it should be concluded that I am not getting the same Unicode string
> as at the beginning (a) where [(a) != (b)]
>
> I was then wondering what kind of encoding I got.
> I looked at this web site http://2cyr.com/decode/?lang=en to understand more
> and it looks like I got windows-1251 (see [2])
> that can be displayed in a browser as encoding="iso8859-1".
>
> My question is: Why didn't i get UTF-8 and how is it possible I got (b)
> ?????
>
>
> Thank you for your reading and any comments you might have.
>
> José Ferreiro
>
> Many thanks to Martin Gainty and Ognjen Blagojevic for already commeting and
> helping in another thread I posted.
>
>
> [1] -
> http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html
> [2] - http://en.wikipedia.org/wiki/CP1251
>
> PS: Thanks to Martin and
>
> --
> José Ferreiro
> MSc in Communication Systems, EPFL.
>
>
>