You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Michael Mealling <mi...@bailey.dscga.com> on 2001/02/05 03:04:10 UTC
serializing XML to a ServletOutputStream fails
(This might be a bug so I'm cc-ing to tomcat-dev)
Hi,
I'm trying to serialize some XML out to a ServletOutputStream but
the resulting XML on the client side contains corrupted Unicode
characters (the DOM I'm serializing out contains Chinese, Korean,
English, etc). Here's the code in question:
response.setContentType("text/xml; charset=UTF-8");
ServletOutputStream out = response.getOutputStream();
out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
" \"http://www.ietf.org/cnrp.dtd\">\n");
out.flush();
OutputFormat format = new OutputFormat(document);
format.setOmitXMLDeclaration(true);
format.setIndenting(true); // it makes debuggin easier
format.setEncoding("UTF-8"); // this is the default anyway
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document.getDocumentElement());
The XML that the client gets is fine except that the non-ASCII subset
of the UTF-8 encoded Unicode characters are garbled. I can serialize
the XML out to a FileOutputStream and it works just fine.
I'm running Tomcat 3.2.1 that's the backend for a remote
Apache 1.3.17 server using ajp13 (and thus mod_jk).
This code looks like its the right way to do this but either
I've hit a bug or else I'm missing something (an encoding somewhere
between a Stream and a Writer?)
-MM
--
--------------------------------------------------------------------------------
Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#: 14198821
Network Solutions | www.lp.org | michaelm@netsol.com
Re: serializing XML to a ServletOutputStream fails
Posted by "Craig R. McClanahan" <Cr...@eng.sun.com>.
Michael Mealling wrote:
> On Mon, Feb 05, 2001 at 08:17:57PM +0900, Takashi Okamoto wrote:
> > From: "Michael Mealling" <mi...@bailey.dscga.com>
> > To: <to...@jakarta.apache.org>
> > Cc: <mi...@netsol.com>
> > Sent: Monday, February 05, 2001 7:54 PM
> > Subject: Re: serializing XML to a ServletOutputStream fails
> >
> > > P.S. I've also posted this problem to HotDispatch so if you
> > > can help me solve the problem you could get $50... ;-)
> >
> > > response.setContentType("text/xml; charset=UTF-8");
> >
> > Could you try following code instead of this?
> >
> > response.setContentType("text/xml; charset=8859_1");
>
> Sure. Same thing. It appears that I get the same output
> regardless of what I set the content type to...
>
That is because you are using an output stream, which is just a stream of
uninterpreted bytes from the viewpoint of the servlet container. Setting the
content type with a character encoding, as described above, will affect the Writer
that is returned by response.getWriter() -- as long as you call setContentType()
first.
>
> -MM
>
Craig McClanahan
Re: serializing XML to a ServletOutputStream fails
Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Mon, Feb 05, 2001 at 08:17:57PM +0900, Takashi Okamoto wrote:
> From: "Michael Mealling" <mi...@bailey.dscga.com>
> To: <to...@jakarta.apache.org>
> Cc: <mi...@netsol.com>
> Sent: Monday, February 05, 2001 7:54 PM
> Subject: Re: serializing XML to a ServletOutputStream fails
>
> > P.S. I've also posted this problem to HotDispatch so if you
> > can help me solve the problem you could get $50... ;-)
>
> > response.setContentType("text/xml; charset=UTF-8");
>
> Could you try following code instead of this?
>
> response.setContentType("text/xml; charset=8859_1");
Sure. Same thing. It appears that I get the same output
regardless of what I set the content type to...
-MM
--
--------------------------------------------------------------------------------
Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#: 14198821
Network Solutions | www.lp.org | michaelm@netsol.com
Re: serializing XML to a ServletOutputStream fails
Posted by Takashi Okamoto <to...@rd.nttdata.co.jp>.
From: "Michael Mealling" <mi...@bailey.dscga.com>
To: <to...@jakarta.apache.org>
Cc: <mi...@netsol.com>
Sent: Monday, February 05, 2001 7:54 PM
Subject: Re: serializing XML to a ServletOutputStream fails
> P.S. I've also posted this problem to HotDispatch so if you
> can help me solve the problem you could get $50... ;-)
> response.setContentType("text/xml; charset=UTF-8");
Could you try following code instead of this?
response.setContentType("text/xml; charset=8859_1");
-------------------------------------
takashi
RE: serializing XML to a ServletOutputStream fails
Posted by Zhu Ming <mi...@bequbed.com>.
I found the following implementation source code from
the tomcat 3.2 codeline, in src\org\apache\tomcat\core\ResponseImpl.java
or
http://jakarta.apache.org/cvsweb/index.cgi/jakarta-tomcat/src/share/org/apac
he/tomcat/core/Attic/ResponseImpl.java?rev=1.33.2.5&content-type=text/vnd.vi
ewcvs-markup
/** Write a chunk of bytes. Should be called only from
ServletOutputStream implementations,
* No need to implement it if your adapter implements
ServletOutputStream.
* Headers and status will be written before this method is exceuted.
*/
public void doWrite( byte buffer[], int pos, int count) throws
IOException {
// XXX fix if charset is other than default.
if( body==null)
body=new StringBuffer();
body.append(new String(buffer, pos, count,
Constants.DEFAULT_CHAR_ENCODING) );
}
So, it looks like that tomcat 3.2 have only implemented
ServletOutputStream for default charactor set, but not
for the others.
Ming
----------------
P.S.: Thanks the Unicode lesson. :) ... It seems I have no chance
to get $50 by my poor Unicode knowledge.
-----Original Message-----
From: Michael Mealling [mailto:michael@bailey.dscga.com]
Sent: Monday, February 05, 2001 11:55
To: tomcat-dev@jakarta.apache.org
Cc: michaelm@netsol.com
Subject: Re: serializing XML to a ServletOutputStream fails
On Mon, Feb 05, 2001 at 11:24:55AM +0100, Zhu Ming wrote:
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".
UTF-8 is an encoding that allows the multibyte (16 and higher)
Unicode code points to be encoded in 8 bits, not limited to 8 bits.
If a byte has its high order bit set then you know that the next
few bytes are also part of that particular code-point. So UTF-8
also handles the entire Unicode set. XML itself defaults to UTF-8
so its something that _should_ work 'out of the box'...
-MM
P.S. I've also posted this problem to HotDispatch so if you
can help me solve the problem you could get $50... ;-)
--
----------------------------------------------------------------------------
----
Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#:
14198821
Network Solutions | www.lp.org | michaelm@netsol.com
---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, email: tomcat-dev-help@jakarta.apache.org
Re: serializing XML to a ServletOutputStream fails
Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Mon, Feb 05, 2001 at 11:24:55AM +0100, Zhu Ming wrote:
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".
UTF-8 is an encoding that allows the multibyte (16 and higher)
Unicode code points to be encoded in 8 bits, not limited to 8 bits.
If a byte has its high order bit set then you know that the next
few bytes are also part of that particular code-point. So UTF-8
also handles the entire Unicode set. XML itself defaults to UTF-8
so its something that _should_ work 'out of the box'...
-MM
P.S. I've also posted this problem to HotDispatch so if you
can help me solve the problem you could get $50... ;-)
--
--------------------------------------------------------------------------------
Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#: 14198821
Network Solutions | www.lp.org | michaelm@netsol.com
Re: serializing XML to a ServletOutputStream fails
Posted by Dimitris Dinodimos <di...@yahoo.com>.
Use the PrintWriter object returned by
response.getWriter().
You will find more info at
http://java.sun.com/products/servlet/2.2/javadoc/javax/servlet/ServletResponse.html
--- Michael Mealling <mi...@bailey.dscga.com> wrote:
> (This might be a bug so I'm cc-ing to tomcat-dev)
> Hi,
> I'm trying to serialize some XML out to a
> ServletOutputStream but
> the resulting XML on the client side contains
> corrupted Unicode
> characters (the DOM I'm serializing out contains
> Chinese, Korean,
> English, etc). Here's the code in question:
>
> response.setContentType("text/xml;
> charset=UTF-8");
> ServletOutputStream out =
> response.getOutputStream();
>
> This code looks like its the right way to do this
> but either
> I've hit a bug or else I'm missing something (an
> encoding somewhere
> between a Stream and a Writer?)
>
__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/
Re: serializing XML to a ServletOutputStream fails
Posted by cm...@yahoo.com.
Hi Michael,
I'll be working on a number of "encoding"-related problems, and support
for different charsets is one of them. The problem is not easy, and it'll
take few weeks - but you should have most of the issues resolved before
tc3.3 beta.
You can help in few ways:
1. Open a bug, with Encoding category.
2. Write a simple servlet that outputs the Unicode you want, test it with
tomcat standalone. Check if it reproduce the problem and attach it to the
bug.
( all test cases will be added to the sanity test, and that's a release
criteria - so it'll have to be fixed :-)
3. Try outputing the unicode without the xml serializer. It may be an xml
problem ( AFAIK OutputStream doesn't care about the content, and it works
fine for images and any binary tesxt ). I have a strong feeling that in
this case the problem is not in tomcat :-)
( well, if it is in xml I'll also have to deal with it :-)
Costin
Re: serializing XML to a ServletOutputStream fails
Posted by Tim Tye <tt...@ticnet.com>.
UTF-16 is not an acceptable encoding for XML as it takes two bytes per
character, is byte order sensitive, and the XML tags would not be
recognized...
UTF-8 is the correct encoding! Any 31 bit character in the ISO10646
specification can be correctly represented in UTF-8. UNICODE is the first
65768 characters of ISO10646.
A CKJ character code point value of 0x6123 is represented in UTF-8 as three
bytes E6 84 A3.
What byte values are you seeing for the encoding of a given Chinese code
point?
----- Original Message -----
From: Zhu Ming <mi...@bequbed.com>
To: <to...@jakarta.apache.org>; <mi...@netsol.com>
Sent: Monday, February 05, 2001 4:24 AM
Subject: RE: serializing XML to a ServletOutputStream fails
> Hi,
>
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".
>
> I'm not an Unicode expert. I'll be happy if what I say can
> be a hint to solve this problem.
>
> Ming
>
>
> -----Original Message-----
> From: Michael Mealling [mailto:michael@bailey.dscga.com]
> Sent: Monday, February 05, 2001 03:04
> To: tomcat-dev@jakarta.apache.org
> Subject: serializing XML to a ServletOutputStream fails
>
>
> (This might be a bug so I'm cc-ing to tomcat-dev)
> Hi,
> I'm trying to serialize some XML out to a ServletOutputStream but
> the resulting XML on the client side contains corrupted Unicode
> characters (the DOM I'm serializing out contains Chinese, Korean,
> English, etc). Here's the code in question:
>
> response.setContentType("text/xml; charset=UTF-8");
> ServletOutputStream out = response.getOutputStream();
>
> out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
> "<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
> " \"http://www.ietf.org/cnrp.dtd\">\n");
> out.flush();
> OutputFormat format = new OutputFormat(document);
> format.setOmitXMLDeclaration(true);
> format.setIndenting(true); // it makes debuggin easier
> format.setEncoding("UTF-8"); // this is the default anyway
> XMLSerializer serializer = new XMLSerializer(out, format);
> serializer.serialize(document.getDocumentElement());
>
> The XML that the client gets is fine except that the non-ASCII subset
> of the UTF-8 encoded Unicode characters are garbled. I can serialize
> the XML out to a FileOutputStream and it works just fine.
>
> I'm running Tomcat 3.2.1 that's the backend for a remote
> Apache 1.3.17 server using ajp13 (and thus mod_jk).
>
> This code looks like its the right way to do this but either
> I've hit a bug or else I'm missing something (an encoding somewhere
> between a Stream and a Writer?)
>
> -MM
>
> --
> --------------------------------------------------------------------------
--
> ----
> Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
> Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#:
> 14198821
> Network Solutions | www.lp.org | michaelm@netsol.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, email: tomcat-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, email: tomcat-dev-help@jakarta.apache.org
>
>
RE: serializing XML to a ServletOutputStream fails
Posted by cm...@yahoo.com.
>
> Maybe you should not use character set "UTF-8". I remember
> that it's 8-bit Unicode. As I know, Chinese and Korean has
> 16-bit code. So at least, you should try 16-bit Unicode.
> I forgot the name, maybe it's "UTF-16". But I'm not sure if
> JDK have fully support to "UTF-16".
UTF8 is ok - it means ASCII characters are encoded with one byte,
non-ASCII with more bytes. AFAIK JDK supports both UTF8 and UTF16, I'm not
sure about browsers.
In any case, the problem shouldn't be that.
Please check a simpler case, without the xml serializer.
--
Costin
RE: serializing XML to a ServletOutputStream fails
Posted by Zhu Ming <mi...@bequbed.com>.
Hi,
Maybe you should not use character set "UTF-8". I remember
that it's 8-bit Unicode. As I know, Chinese and Korean has
16-bit code. So at least, you should try 16-bit Unicode.
I forgot the name, maybe it's "UTF-16". But I'm not sure if
JDK have fully support to "UTF-16".
I'm not an Unicode expert. I'll be happy if what I say can
be a hint to solve this problem.
Ming
-----Original Message-----
From: Michael Mealling [mailto:michael@bailey.dscga.com]
Sent: Monday, February 05, 2001 03:04
To: tomcat-dev@jakarta.apache.org
Subject: serializing XML to a ServletOutputStream fails
(This might be a bug so I'm cc-ing to tomcat-dev)
Hi,
I'm trying to serialize some XML out to a ServletOutputStream but
the resulting XML on the client side contains corrupted Unicode
characters (the DOM I'm serializing out contains Chinese, Korean,
English, etc). Here's the code in question:
response.setContentType("text/xml; charset=UTF-8");
ServletOutputStream out = response.getOutputStream();
out.print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<!DOCTYPE cnrp PUBLIC \"-//IETF//DTD CNRP 1.0//EN\"" +
" \"http://www.ietf.org/cnrp.dtd\">\n");
out.flush();
OutputFormat format = new OutputFormat(document);
format.setOmitXMLDeclaration(true);
format.setIndenting(true); // it makes debuggin easier
format.setEncoding("UTF-8"); // this is the default anyway
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document.getDocumentElement());
The XML that the client gets is fine except that the non-ASCII subset
of the UTF-8 encoded Unicode characters are garbled. I can serialize
the XML out to a FileOutputStream and it works just fine.
I'm running Tomcat 3.2.1 that's the backend for a remote
Apache 1.3.17 server using ajp13 (and thus mod_jk).
This code looks like its the right way to do this but either
I've hit a bug or else I'm missing something (an encoding somewhere
between a Stream and a Writer?)
-MM
--
----------------------------------------------------------------------------
----
Michael Mealling | Vote Libertarian! | www.rwhois.net/michael
Sr. Research Engineer | www.ga.lp.org/gwinnett | ICQ#:
14198821
Network Solutions | www.lp.org | michaelm@netsol.com
---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, email: tomcat-dev-help@jakarta.apache.org