You are viewing a plain text version of this content. The canonical link for it is here.
Posted to soap-dev@xml.apache.org by Mike Spreitzer <ms...@us.ibm.com> on 2001/04/11 07:07:43 UTC

Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary Unicode character?

I used tcpdump to capture traffic containing three interesting call 
messages, containing (respectively) the Strings: "qqq\u00F6zzz", 
"qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the 
data being sent.

The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was 
sent; (2) the CDATA start tag was sent without being quoted; and (3) the 
\u2030 was sent correctly.

Unhappily,
Mike

Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary Unicode character?

Posted by Scott Nichol <sn...@computer.org>.
Sorry, but the message below contains incorrect information.  The line from
HTTPUtils.java is simply writing the HTTP headers.  The body of the request is
ultimately written using classes in javax.mail (for MIME support).  It appears that a
simple body is written in UTF8.

Scott

----- Original Message -----
From: "Scott Nichol" <sn...@computer.org>
To: <so...@xml.apache.org>
Sent: Wednesday, April 11, 2001 12:29 PM
Subject: Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary
Unicode character?


> Mike,
>
> The line of code in HTTPUtils.java that actually writes the HTTP POST request is
>
>       bOutStream.write(
>           headerbuf.toString().getBytes(Constants.HEADERVAL_DEFAULT_CHARSET));
>
> Unfortunately, Constants.java contains the line
>
>   public static final String HEADERVAL_DEFAULT_CHARSET = "iso-8859-1";
>
> This means that, despite the Content-Type header and xml processing instruction
> specifying utf-8, the data is actually sent in the 8-bit Western character set, which
is
> a subset of UTF-8.  It should be as simple as
> using the value "UTF8" in the HTTPUtils.java getBytes() call to truly support utf-8.
>
> What do the commiters think about making such a change?
>
> Scott Nichol
>
> ----- Original Message -----
> From: "Mike Spreitzer" <ms...@us.ibm.com>
> To: <so...@xml.apache.org>
> Cc: <so...@xml.apache.org>
> Sent: Wednesday, April 11, 2001 1:07 AM
> Subject: Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary
> Unicode character?
>
>
> > I used tcpdump to capture traffic containing three interesting call
> > messages, containing (respectively) the Strings: "qqq\u00F6zzz",
> > "qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the
> > data being sent.
> >
> > The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was
> > sent; (2) the CDATA start tag was sent without being quoted; and (3) the
> > \u2030 was sent correctly.
> >
> > Unhappily,
> > Mike
>
>


Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary Unicode character?

Posted by Scott Nichol <sn...@computer.org>.
Mike,

The line of code in HTTPUtils.java that actually writes the HTTP POST request is

      bOutStream.write(
          headerbuf.toString().getBytes(Constants.HEADERVAL_DEFAULT_CHARSET));

Unfortunately, Constants.java contains the line

  public static final String HEADERVAL_DEFAULT_CHARSET = "iso-8859-1";

This means that, despite the Content-Type header and xml processing instruction
specifying utf-8, the data is actually sent in the 8-bit Western character set, which is
a subset of UTF-8.  It should be as simple as
using the value "UTF8" in the HTTPUtils.java getBytes() call to truly support utf-8.

What do the commiters think about making such a change?

Scott Nichol

----- Original Message -----
From: "Mike Spreitzer" <ms...@us.ibm.com>
To: <so...@xml.apache.org>
Cc: <so...@xml.apache.org>
Sent: Wednesday, April 11, 2001 1:07 AM
Subject: Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary
Unicode character?


> I used tcpdump to capture traffic containing three interesting call
> messages, containing (respectively) the Strings: "qqq\u00F6zzz",
> "qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the
> data being sent.
>
> The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was
> sent; (2) the CDATA start tag was sent without being quoted; and (3) the
> \u2030 was sent correctly.
>
> Unhappily,
> Mike


Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary Unicode character?

Posted by Simon Fell <so...@zaks.demon.co.uk>.
What version are you using ?, i believe these problems are not in the
latest code (i.e. build from CVS, or use a nightly build)

Cheers
Simon


On Wed, 11 Apr 2001 01:07:43 -0400, in soap you wrote:

>I used tcpdump to capture traffic containing three interesting call 
>messages, containing (respectively) the Strings: "qqq\u00F6zzz", 
>"qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the 
>data being sent.
>
>The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was 
>sent; (2) the CDATA start tag was sent without being quoted; and (3) the 
>\u2030 was sent correctly.
>
>Unhappily,
>Mike


Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary Unicode character?

Posted by Scott Nichol <sn...@computer.org>.
BTW, in UTF8, some characters will have a different binary representation.  I wrote the
following snippet to look at the strings:

import java.io.*;

public class UnicodeOutputTest {
 public static void main(String[] args) {
  try {
   FileWriter fw = new FileWriter("UnicodeOutputTestDefault.out", false);
   fw.write("qqq\u00F6zzz");
   fw.write("qqq<![CDATA[zzz");
   fw.write("qqq\u2030zzz");
   System.out.println(fw.getEncoding());
   fw.close();
   FileOutputStream fos = new FileOutputStream("UnicodeOutputTestUTF8.out", false);
   OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8");
   osw.write("qqq\u00F6zzz");
   osw.write("qqq<![CDATA[zzz");
   osw.write("qqq\u2030zzz");
   System.out.println(osw.getEncoding());
   osw.close();
   fos = new FileOutputStream("UnicodeOutputTestUnicode.out", false);
   osw = new OutputStreamWriter(fos, "Unicode");
   osw.write("qqq\u00F6zzz");
   osw.write("qqq<![CDATA[zzz");
   osw.write("qqq\u2030zzz");
   System.out.println(osw.getEncoding());
   osw.close();
  } catch (Exception e) {
   e.printStackTrace();
  }
 }
}

Scott

----- Original Message -----
From: "Mike Spreitzer" <ms...@us.ibm.com>
To: <so...@xml.apache.org>
Cc: <so...@xml.apache.org>
Sent: Wednesday, April 11, 2001 1:07 AM
Subject: Re: Can SOAP 2.1 and Xerces 1.2.2 transport a String containing an arbitrary
Unicode character?


> I used tcpdump to capture traffic containing three interesting call
> messages, containing (respectively) the Strings: "qqq\u00F6zzz",
> "qqq<![CDATA[zzz", and "qqq\u2030zzz" (in Java source notation) among the
> data being sent.
>
> The tcpdump output shows: (1) where \u00F6 goes, a single 00 byte was
> sent; (2) the CDATA start tag was sent without being quoted; and (3) the
> \u2030 was sent correctly.
>
> Unhappily,
> Mike