You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@axis.apache.org by Michael Serero <mi...@covad.net> on 2004/01/21 01:05:28 UTC

Character encoding

Yet another character encoding issue!

When I send a SOAP request to my server if one of the String contains a
smart quotes (“) the server generates the following parsing error:

org.xml.sax.SAXParseException: Character conversion error: "Unconvertible
UTF-8 character beginning with 0x93" (line number may be too low).

I have checked the request and the smart quote is embedded in the request as
is. Axis uses the SimpleSerializer to serialized java.lang.String.
I browse the Serializer code and it does not seem to convert the String
using UTF-8 encoding before sending the request.

That seems a big flaw, given the fact that the XML of the request must use
UTF-8. I believe there is something I missed.

Do I really need to write my own serializer/deserializer for
java.lang.String to take care of the Unicode / UTF-8 conversion?

Michael



Re: Character encoding

Posted by Jens Schumann <ml...@void.fm>.
On 1/27/04 10:20 PM Michael Serero <mi...@covad.net> wrote:

> Nelson,
> 
> Thanks for your reply. Do you have any suggestion on how to convert CP1252
> to UTF-8?
> 
> I have tried something along the following lines:
> 
> String myString = "The CP1252 string";
> Charset cs = Charset.forName("UTF-8");
>     context.writeSafeString(new String(cs.encode(myString).array()));
> 
> But it did not work for me.

Since the character problem occurs on the server side, what is your client?
If you use axis client side you shouldn't have problems, but you may better
use a nightly snapshot.

Jens


RE: Character encoding

Posted by Michael Serero <mi...@covad.net>.
Nelson,

Thanks for your reply. Do you have any suggestion on how to convert CP1252
to UTF-8?

I have tried something along the following lines:

	String myString = "The CP1252 string";
 	Charset cs = Charset.forName("UTF-8");
      context.writeSafeString(new String(cs.encode(myString).array()));

But it did not work for me.

Also I am puzzled by the XMLUtils.encodeString() method.
If the string argument contains one of the character &, ", \, ', < or >,
all those characters plus any characters coded on more than one byte also
get escape.

The substitution does not take place if the "magic" characters are not in
the string (?). In other words the non US-ASCII characters get encoded
differently based on whether other characters in the string need encoding.

Michael


-----Original Message-----
From: Nelson Minar [mailto:nelson@monkey.org]
Sent: Tuesday, January 20, 2004 4:16 PM
To: axis-user@ws.apache.org
Subject: Re: Character encoding


>When I send a SOAP request to my server if one of the String contains a
>smart quotes (“) the server generates the following parsing error:
>org.xml.sax.SAXParseException: Character conversion error: "Unconvertible
>UTF-8 character beginning with 0x93" (line number may be too low).

I'm not positive, but I suspect the problem is that 0x93 is not a
valid way to encode a quotation mark in UTF-8. Depending on what byte
follows 0x93 the input may not even be valid UTF-8, which I think is
what that error is telling you.

Whatever software is generating that request is probably taking
Windows CP1252 and pretending it's UTF-8. You'll need to fix that.


Re: Character encoding

Posted by Nelson Minar <ne...@monkey.org>.
>When I send a SOAP request to my server if one of the String contains a
>smart quotes (“) the server generates the following parsing error:
>org.xml.sax.SAXParseException: Character conversion error: "Unconvertible
>UTF-8 character beginning with 0x93" (line number may be too low).

I'm not positive, but I suspect the problem is that 0x93 is not a
valid way to encode a quotation mark in UTF-8. Depending on what byte
follows 0x93 the input may not even be valid UTF-8, which I think is
what that error is telling you.

Whatever software is generating that request is probably taking
Windows CP1252 and pretending it's UTF-8. You'll need to fix that.