You are viewing a plain text version of this content. The canonical link for it is here.
Posted to soap-dev@xml.apache.org by Mike Spreitzer <ms...@us.ibm.com> on 2001/04/25 21:16:21 UTC

Current SOAP and Xerces 1.2.2 can NOT transport a String containing arbitrary Unicode characters

I've tried the April 24 nightly build of Apache SOAP, with Xerces 1.2.2, 
and both problems reported below are still present.
***************************************************************************************
Please respond to soap-dev@xml.apache.org 
To:     soap-user@xml.apache.org
cc:     soap-dev@xml.apache.org 
Subject:        Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an 
arbitrary Unicode character?



I used tcpdump to capture traffic containing three interesting call
messages, containing (respectively) the Strings: "qqq\u00F6zzz",
"qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the
data being sent.

The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was
sent; (2) the CDATA start tag was sent without being quoted; and (3) the
\u2030 was sent correctly.

Unhappily,
Mike



Re: Current SOAP and Xerces 1.2.2 can NOT transport a String containing arbitrary Unicode characters

Posted by Wouter Cloetens <wo...@mind.be>.
Mike,

I don't get it. The Unicode to UTF-8 encoding and back works peachy here. I'm
attaching a sample that sends some text up to a service that simply loops the
info back. The text contains the \u2030 and \u00F6 characters of yours:

Java source code:

// Dutch
"De zo\u00f6logische onderzee\u00ebr zinkt zo\u00ebven. " +
// French
"L'\u00e9l\u00e8ve va \u00e0 l'\u00e9cole. " +
// Spanish
"\u00bfQue? \u00a1\u00d1ina, hasta ma\u00f1ana! " +
// Symbols
"Total cost of \u00b13\u2030 of the GNP. " +
// Euro currency symbol
"Total price: \u20a0 42.00" +
// XML-unsafe text
"<moo volume=\"LOUD\">cows 'n' cats & dogs</moo>";

Hexdump of this string, converted to 16-bit Unicode by using
String.getBytes("Unicode"):

fe ff 00 44 00 65 00 20  00 7a 00 6f 00 f6 00 6c  ...D.e. .z.o...l
00 6f 00 67 00 69 00 73  00 63 00 68 00 65 00 20  .o.g.i.s.c.h.e. 
00 6f 00 6e 00 64 00 65  00 72 00 7a 00 65 00 65  .o.n.d.e.r.z.e.e
00 eb 00 72 00 20 00 7a  00 69 00 6e 00 6b 00 74  ...r. .z.i.n.k.t
00 20 00 7a 00 6f 00 eb  00 76 00 65 00 6e 00 2e  . .z.o...v.e.n..
00 20 00 4c 00 27 00 e9  00 6c 00 e8 00 76 00 65  . .L.'...l...v.e
00 20 00 76 00 61 00 20  00 e0 00 20 00 6c 00 27  . .v.a. ... .l.'
00 e9 00 63 00 6f 00 6c  00 65 00 2e 00 20 00 bf  ...c.o.l.e... ..
00 51 00 75 00 65 00 3f  00 20 00 a1 00 d1 00 69  .Q.u.e.?. .....i
00 6e 00 61 00 2c 00 20  00 68 00 61 00 73 00 74  .n.a.,. .h.a.s.t
00 61 00 20 00 6d 00 61  00 f1 00 61 00 6e 00 61  .a. .m.a...a.n.a
00 21 00 20 00 54 00 6f  00 74 00 61 00 6c 00 20  .!. .T.o.t.a.l. 
00 63 00 6f 00 73 00 74  00 20 00 6f 00 66 00 20  .c.o.s.t. .o.f. 
00 b1 00 33 20 30 00 20  00 6f 00 66 00 20 00 74  ...3 0. .o.f. .t
00 68 00 65 00 20 00 47  00 4e 00 50 00 2e 00 20  .h.e. .G.N.P... 
00 54 00 6f 00 74 00 61  00 6c 00 20 00 70 00 72  .T.o.t.a.l. .p.r
00 69 00 63 00 65 00 3a  00 20 20 a0 00 20 00 34  .i.c.e.:.  .. .4
00 32 00 2e 00 30 00 30  00 3c 00 6d 00 6f 00 6f  .2...0.0.<.m.o.o
00 20 00 76 00 6f 00 6c  00 75 00 6d 00 65 00 3d  . .v.o.l.u.m.e.=
00 22 00 4c 00 4f 00 55  00 44 00 22 00 3e 00 63  .".L.O.U.D.".>.c
00 6f 00 77 00 73 00 20  00 27 00 6e 00 27 00 20  .o.w.s. .'.n.'. 
00 63 00 61 00 74 00 73  00 20 00 26 00 20 00 64  .c.a.t.s. .&. .d
00 6f 00 67 00 73 00 3c  00 2f 00 6d 00 6f 00 6f  .o.g.s.<./.m.o.o
00 3e                                             .>

Network trace of UTF-8 encoded element:

3c 73 74 72 69 6e 67 20  78 73 69 3a 74 79 70 65   <string. xsi:type 
3d 22 78 73 64 3a 73 74  72 69 6e 67 22 3e 44 65   ="xsd:st ring">De 
20 7a 6f c3 b6 6c 6f 67  69 73 63 68 65 20 6f 6e   .zo..log ische.on 
64 65 72 7a 65 65 c3 ab  72 20 7a 69 6e 6b 74 20   derzee.. r.zinkt. 
7a 6f c3 ab 76 65 6e 2e  20 4c 26 61 70 6f 73 3b   zo..ven. .L&apos; 
c3 a9 6c c3 a8 76 65 20  76 61 20 c3 a0 20 6c 26   ..l..ve. va....l& 
61 70 6f 73 3b c3 a9 63  6f 6c 65 2e 20 c2 bf 51   apos;..c ole....Q 
75 65 3f 20 c2 a1 c3 91  69 6e 61 2c 20 68 61 73   ue?..... ina,.has 
74 61 20 6d 61 c3 b1 61  6e 61 21 20 54 6f 74 61   ta.ma..a na!.Tota 
6c 20 63 6f 73 74 20 6f  66 20 c2 b1 33 e2 80 b0   l.cost.o f...3... 
20 6f 66 20 74 68 65 20  47 4e 50 2e 20 54 6f 74   .of.the. GNP..Tot 
61 6c 20 70 72 69 63 65  3a 20 e2 82 a0 20 34 32   al.price :.....42 
2e 30 30 26 6c 74 3b 6d  6f 6f 20 76 6f 6c 75 6d   .00&lt;m oo.volum 
65 3d 26 71 75 6f 74 3b  4c 4f 55 44 26 71 75 6f   e=&quot; LOUD&quo 
74 3b 26 67 74 3b 63 6f  77 73 20 26 61 70 6f 73   t;&gt;co ws.&apos 
3b 6e 26 61 70 6f 73 3b  20 63 61 74 73 20 26 61   ;n&apos; .cats.&a 
6d 70 3b 20 64 6f 67 73  26 6c 74 3b 2f 6d 6f 6f   mp;.dogs &lt;/moo 
26 67 74 3b 3c 2f 73 74  72 69 6e 67 3e            &gt;</st ring>



Now the CDATA stuff is another issue, beyond my realm of competence. If I send
up the string <![CDATA[, I get this exception:

[SOAPException: faultCode=SOAP-ENV:Client; msg=Parsing error, response was:
The character sequence "]]>" must not appear in content unless used to mark the
end of a CDATA section.; targetException=org.xml.sax.SAXParseException: The
character sequence "]]>" must not appear in content unless used to mark the end
of a CDATA section.]

bfn, Wouter


On Wed, Apr 25, 2001 at 03:16:21PM -0400, Mike Spreitzer wrote:
> I've tried the April 24 nightly build of Apache SOAP, with Xerces 1.2.2, 
> and both problems reported below are still present.
> ***************************************************************************************
> I used tcpdump to capture traffic containing three interesting call
> messages, containing (respectively) the Strings: "qqq\u00F6zzz",
> "qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the
> data being sent.
> 
> The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was
> sent; (2) the CDATA start tag was sent without being quoted; and (3) the
> \u2030 was sent correctly.