You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Michael Mealling <mi...@bailey.dscga.com> on 2001/02/05 03:04:10 UTC

serializing XML to a ServletOutputStream fails

(This might be a bug so I'm cc-ing to tomcat-dev)
Hi,
    I'm trying to serialize some XML out to a ServletOutputStream but
the resulting XML on the client side contains corrupted Unicode
characters (the DOM I'm serializing out contains Chinese, Korean,
English, etc). Here's the code in question:

        response.setContentType("text/xml; charset=UTF-8");
        ServletOutputStream out = response.getOutputStream();

        out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
                   "<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
                   " \"http://www.ietf.org/cnrp.dtd\">\n");
        out.flush();
        OutputFormat format = new OutputFormat(document);
        format.setOmitXMLDeclaration(true);
        format.setIndenting(true); // it makes debuggin easier
        format.setEncoding("UTF-8"); // this is the default anyway
        XMLSerializer serializer = new XMLSerializer(out, format);
        serializer.serialize(document.getDocumentElement());

The XML that the client gets is fine except that the non-ASCII subset
of the UTF-8 encoded Unicode characters are garbled. I can serialize
the XML out to a FileOutputStream and it works just fine.

I'm running Tomcat 3.2.1 that's the backend for a remote
Apache 1.3.17 server using ajp13 (and thus mod_jk).

This code looks like its the right way to do this but either
I've hit a bug or else I'm missing something (an encoding somewhere
between a Stream and a Writer?)

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: serializing XML to a ServletOutputStream fails

Posted by "Craig R. McClanahan" <Cr...@eng.sun.com>.
Michael Mealling wrote:

> On Mon, Feb 05, 2001 at 08:17:57PM +0900, Takashi Okamoto wrote:
> > From: "Michael Mealling" <mi...@bailey.dscga.com>
> > To: <to...@jakarta.apache.org>
> > Cc: <mi...@netsol.com>
> > Sent: Monday, February 05, 2001 7:54 PM
> > Subject: Re: serializing XML to a ServletOutputStream fails
> >
> > > P.S. I've also posted this problem to HotDispatch so if you
> > > can help me solve the problem you could get $50... ;-)
> >
> > >        response.setContentType("text/xml; charset=UTF-8");
> >
> > Could you try following code instead of this?
> >
> >        response.setContentType("text/xml; charset=8859_1");
>
> Sure. Same thing. It appears that I get the same output
> regardless of what I set the content type to...
>

That is because you are using an output stream, which is just a stream of
uninterpreted bytes from the viewpoint of the servlet container.  Setting the
content type with a character encoding, as described above, will affect the Writer
that is returned by response.getWriter() -- as long as you call setContentType()
first.

>
> -MM
>

Craig McClanahan



Re: serializing XML to a ServletOutputStream fails

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Mon, Feb 05, 2001 at 08:17:57PM +0900, Takashi Okamoto wrote:
> From: "Michael Mealling" <mi...@bailey.dscga.com>
> To: <to...@jakarta.apache.org>
> Cc: <mi...@netsol.com>
> Sent: Monday, February 05, 2001 7:54 PM
> Subject: Re: serializing XML to a ServletOutputStream fails
> 
> > P.S. I've also posted this problem to HotDispatch so if you
> > can help me solve the problem you could get $50... ;-)
> 
> >        response.setContentType("text/xml; charset=UTF-8");
> 
> Could you try following code instead of this?
> 
>        response.setContentType("text/xml; charset=8859_1");

Sure. Same thing. It appears that I get the same output
regardless of what I set the content type to...

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: serializing XML to a ServletOutputStream fails

Posted by Takashi Okamoto <to...@rd.nttdata.co.jp>.
From: "Michael Mealling" <mi...@bailey.dscga.com>
To: <to...@jakarta.apache.org>
Cc: <mi...@netsol.com>
Sent: Monday, February 05, 2001 7:54 PM
Subject: Re: serializing XML to a ServletOutputStream fails

> P.S. I've also posted this problem to HotDispatch so if you
> can help me solve the problem you could get $50... ;-)

>        response.setContentType("text/xml; charset=UTF-8");

Could you try following code instead of this?

       response.setContentType("text/xml; charset=8859_1");
-------------------------------------
takashi



RE: serializing XML to a ServletOutputStream fails

Posted by Zhu Ming <mi...@bequbed.com>.
I found the following implementation source code from
the tomcat 3.2 codeline, in src\org\apache\tomcat\core\ResponseImpl.java
or
http://jakarta.apache.org/cvsweb/index.cgi/jakarta-tomcat/src/share/org/apac
he/tomcat/core/Attic/ResponseImpl.java?rev=1.33.2.5&content-type=text/vnd.vi
ewcvs-markup

    /** Write a chunk of bytes. Should be called only from
ServletOutputStream implementations,
     *	No need to implement it if your adapter implements
ServletOutputStream.
     *  Headers and status will be written before this method is exceuted.
     */
    public void doWrite( byte buffer[], int pos, int count) throws
IOException {
        // XXX fix if charset is other than default.
        if( body==null)
	    body=new StringBuffer();
	body.append(new String(buffer, pos, count,
			       Constants.DEFAULT_CHAR_ENCODING) );
    }

So, it looks like that tomcat 3.2 have only implemented
ServletOutputStream for default charactor set, but not
for the others.

Ming
----------------
P.S.: Thanks the Unicode lesson. :) ... It seems I have no chance
to get $50 by my poor Unicode knowledge.



-----Original Message-----
From: Michael Mealling [mailto:michael@bailey.dscga.com]
Sent: Monday, February 05, 2001 11:55
To: tomcat-dev@jakarta.apache.org
Cc: michaelm@netsol.com
Subject: Re: serializing XML to a ServletOutputStream fails


On Mon, Feb 05, 2001 at 11:24:55AM +0100, Zhu Ming wrote:
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".

UTF-8 is an encoding that allows the multibyte (16 and higher)
Unicode code points to be encoded in 8 bits, not limited to 8 bits.
If a byte has its high order bit set then you know that the next
few bytes are also part of that particular code-point. So UTF-8
also handles the entire Unicode set. XML itself defaults to UTF-8
so its something that _should_ work 'out of the box'...

-MM

P.S. I've also posted this problem to HotDispatch so if you
can help me solve the problem you could get $50... ;-)

--
----------------------------------------------------------------------------
----
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:
14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, email: tomcat-dev-help@jakarta.apache.org


Re: serializing XML to a ServletOutputStream fails

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Mon, Feb 05, 2001 at 11:24:55AM +0100, Zhu Ming wrote:
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".

UTF-8 is an encoding that allows the multibyte (16 and higher)
Unicode code points to be encoded in 8 bits, not limited to 8 bits. 
If a byte has its high order bit set then you know that the next 
few bytes are also part of that particular code-point. So UTF-8
also handles the entire Unicode set. XML itself defaults to UTF-8
so its something that _should_ work 'out of the box'...

-MM

P.S. I've also posted this problem to HotDispatch so if you
can help me solve the problem you could get $50... ;-)

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: serializing XML to a ServletOutputStream fails

Posted by Dimitris Dinodimos <di...@yahoo.com>.
Use the PrintWriter object returned by
response.getWriter().
You will find more info at
http://java.sun.com/products/servlet/2.2/javadoc/javax/servlet/ServletResponse.html

--- Michael Mealling <mi...@bailey.dscga.com> wrote:
> (This might be a bug so I'm cc-ing to tomcat-dev)
> Hi,
>     I'm trying to serialize some XML out to a
> ServletOutputStream but
> the resulting XML on the client side contains
> corrupted Unicode
> characters (the DOM I'm serializing out contains
> Chinese, Korean,
> English, etc). Here's the code in question:
> 
>         response.setContentType("text/xml;
> charset=UTF-8");
>         ServletOutputStream out =
> response.getOutputStream();
> 
 
> This code looks like its the right way to do this
> but either
> I've hit a bug or else I'm missing something (an
> encoding somewhere
> between a Stream and a Writer?)
> 
 


__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/

Re: serializing XML to a ServletOutputStream fails

Posted by cm...@yahoo.com.
Hi Michael,

I'll be working on a number of "encoding"-related problems, and support
for different charsets is one of them. The problem is not easy, and it'll
take few weeks - but you should have most of the issues resolved before
tc3.3 beta.

You can help in few ways:

1. Open a bug, with Encoding category. 

2. Write a simple servlet that outputs the Unicode you want, test it with
tomcat standalone. Check if it reproduce the problem and attach it to the 
bug. 
( all test cases will be added to the sanity test, and that's a release
criteria - so it'll have to be fixed :-)

3. Try outputing the unicode without the xml serializer. It may be an xml
problem ( AFAIK OutputStream doesn't care about the content, and it works
fine for images and any binary tesxt ). I have a strong feeling that in
this case the problem is not in tomcat :-)
( well, if it is in xml I'll also have to deal with it :-)


Costin


Re: serializing XML to a ServletOutputStream fails

Posted by Tim Tye <tt...@ticnet.com>.
UTF-16 is not an acceptable encoding for XML as it takes two bytes per
character, is byte order sensitive, and the XML tags would not be
recognized...
UTF-8 is the correct encoding!  Any 31 bit character in the ISO10646
specification can be correctly represented in UTF-8.  UNICODE is the first
65768 characters of ISO10646.
A CKJ character code point value of 0x6123 is represented in UTF-8 as three
bytes E6 84 A3.
What byte values are you seeing for the encoding of a given Chinese code
point?

----- Original Message -----
From: Zhu Ming <mi...@bequbed.com>
To: <to...@jakarta.apache.org>; <mi...@netsol.com>
Sent: Monday, February 05, 2001 4:24 AM
Subject: RE: serializing XML to a ServletOutputStream fails


> Hi,
>
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".
>
> I'm not an Unicode expert. I'll be happy if what I say can
> be a hint to solve this problem.
>
> Ming
>
>
> -----Original Message-----
> From: Michael Mealling [mailto:michael@bailey.dscga.com]
> Sent: Monday, February 05, 2001 03:04
> To: tomcat-dev@jakarta.apache.org
> Subject: serializing XML to a ServletOutputStream fails
>
>
> (This might be a bug so I'm cc-ing to tomcat-dev)
> Hi,
>     I'm trying to serialize some XML out to a ServletOutputStream but
> the resulting XML on the client side contains corrupted Unicode
> characters (the DOM I'm serializing out contains Chinese, Korean,
> English, etc). Here's the code in question:
>
>         response.setContentType("text/xml; charset=UTF-8");
>         ServletOutputStream out = response.getOutputStream();
>
>         out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
>                    "<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
>                    " \"http://www.ietf.org/cnrp.dtd\">\n");
>         out.flush();
>         OutputFormat format = new OutputFormat(document);
>         format.setOmitXMLDeclaration(true);
>         format.setIndenting(true); // it makes debuggin easier
>         format.setEncoding("UTF-8"); // this is the default anyway
>         XMLSerializer serializer = new XMLSerializer(out, format);
>         serializer.serialize(document.getDocumentElement());
>
> The XML that the client gets is fine except that the non-ASCII subset
> of the UTF-8 encoded Unicode characters are garbled. I can serialize
> the XML out to a FileOutputStream and it works just fine.
>
> I'm running Tomcat 3.2.1 that's the backend for a remote
> Apache 1.3.17 server using ajp13 (and thus mod_jk).
>
> This code looks like its the right way to do this but either
> I've hit a bug or else I'm missing something (an encoding somewhere
> between a Stream and a Writer?)
>
> -MM
>
> --
> --------------------------------------------------------------------------
--
> ----
> Michael Mealling |      Vote Libertarian!       | www.rwhois.net/michael
> Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:
> 14198821
> Network Solutions |          www.lp.org          |  michaelm@netsol.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, email: tomcat-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, email: tomcat-dev-help@jakarta.apache.org
>
>


RE: serializing XML to a ServletOutputStream fails

Posted by cm...@yahoo.com.
> 
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".

UTF8 is ok - it means ASCII characters are encoded with one byte, 
non-ASCII with more bytes. AFAIK JDK supports both UTF8 and UTF16, I'm not
sure about browsers.

In any case, the problem shouldn't be that.

Please check a simpler case, without the xml serializer.

-- 
Costin


RE: serializing XML to a ServletOutputStream fails

Posted by Zhu Ming <mi...@bequbed.com>.
Hi,

Maybe you should not use character set "UTF-8". I remember
that it's 8-bit Unicode. As I know, Chinese and Korean has
16-bit code. So at least, you should try 16-bit Unicode.
I forgot the name, maybe it's "UTF-16". But I'm not sure if
JDK have fully support to "UTF-16".

I'm not an Unicode expert. I'll be happy if what I say can
be a hint to solve this problem.

Ming


-----Original Message-----
From: Michael Mealling [mailto:michael@bailey.dscga.com]
Sent: Monday, February 05, 2001 03:04
To: tomcat-dev@jakarta.apache.org
Subject: serializing XML to a ServletOutputStream fails


(This might be a bug so I'm cc-ing to tomcat-dev)
Hi,
    I'm trying to serialize some XML out to a ServletOutputStream but
the resulting XML on the client side contains corrupted Unicode
characters (the DOM I'm serializing out contains Chinese, Korean,
English, etc). Here's the code in question:

        response.setContentType("text/xml; charset=UTF-8");
        ServletOutputStream out = response.getOutputStream();

        out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
                   "<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
                   " \"http://www.ietf.org/cnrp.dtd\">\n");
        out.flush();
        OutputFormat format = new OutputFormat(document);
        format.setOmitXMLDeclaration(true);
        format.setIndenting(true); // it makes debuggin easier
        format.setEncoding("UTF-8"); // this is the default anyway
        XMLSerializer serializer = new XMLSerializer(out, format);
        serializer.serialize(document.getDocumentElement());

The XML that the client gets is fine except that the non-ASCII subset
of the UTF-8 encoded Unicode characters are garbled. I can serialize
the XML out to a FileOutputStream and it works just fine.

I'm running Tomcat 3.2.1 that's the backend for a remote
Apache 1.3.17 server using ajp13 (and thus mod_jk).

This code looks like its the right way to do this but either
I've hit a bug or else I'm missing something (an encoding somewhere
between a Stream and a Writer?)

-MM

--
----------------------------------------------------------------------------
----
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:
14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, email: tomcat-dev-help@jakarta.apache.org