You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@axis.apache.org by Amandeep Singh <as...@quark.com> on 2008/06/09 22:18:32 UTC

Invalid UTF-8 character encoding in SOAP response

Hi All,

 

I am using axis 1.3. If the response contains a CJK character in UTF-8,
axis converts it into an xml entity. On the receiver side, xml parsing
fails saying that it is an invalid xml entity.

 

The character used has UTF-8 value F0AA989A. And axis converts it into
&#xD869;&#xDE1A;&#xD858;&#xDF4C;. And parser fails at first entity.

 

Any ideas/hints would be greatly appreciated?

 

Thanks,

Aman


RE: Invalid UTF-8 character encoding in SOAP response

Posted by Amandeep Singh <as...@quark.com>.
Posting solution.

The issue is with UTF8Encoder class of axis. The class does not consider
surrogate characters. The solution is to override that class to handle
surrogates.

Is this fixed in latest version of axis? Just curious.

Thanks,
Aman

-----Original Message-----
From: Amandeep Singh [mailto:asingh@quark.com] 
Sent: Monday, June 09, 2008 3:09 PM
To: axis-user@ws.apache.org
Subject: RE: Invalid UTF-8 character encoding in SOAP response

Thanks Andreas. 

My bad. The entity being produced is &#xD869;&#xDE1A;

So, anyone who has axis 1 experience, any suggestions as to how to force
axis to output correct entity?


Thanks,
Aman

-----Original Message-----
From: Andreas Veithen [mailto:andreas.veithen@skynet.be] 
Sent: Monday, June 09, 2008 2:59 PM
To: axis-user@ws.apache.org
Subject: Re: Invalid UTF-8 character encoding in SOAP response

Aman,

D869 DE1A is actually the surrogate pair for the character with code  
point 2A61A, which is encoded as F0AA989A in UTF-8 (see
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi) 
. The two other character references (&#xD858;&#xDF4C;) correspond to  
another character. I'm not an expert, but the XML specs don't mention  
surrogate pairs and I think that the correct way of encoding the  
character as a character reference should be &#x2A61A; in this case.  
This definitely looks like a bug in the XML parser. I would try to  
replace the XML parser by a new version of the same parser or by  
another parser. I'm not familiar with Axis 1, so I don't know what  
kind of parser (SAX or StAX) it uses. Maybe somebody else on the list  
can give a hint?

Andreas


On 9 juin 08, at 22:18, Amandeep Singh wrote:

> Hi All,
>
> I am using axis 1.3. If the response contains a CJK character in  
> UTF-8, axis converts it into an xml entity. On the receiver side,  
> xml parsing fails saying that it is an invalid xml entity.
>
> The character used has UTF-8 value F0AA989A. And axis converts it  
> into &#xD869;&#xDE1A;&#xD858;&#xDF4C;. And parser fails at first  
> entity.
>
> Any ideas/hints would be greatly appreciated?
>
> Thanks,
> Aman


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


RE: Invalid UTF-8 character encoding in SOAP response

Posted by Amandeep Singh <as...@quark.com>.
Thanks Andreas. 

My bad. The entity being produced is &#xD869;&#xDE1A;

So, anyone who has axis 1 experience, any suggestions as to how to force
axis to output correct entity?


Thanks,
Aman

-----Original Message-----
From: Andreas Veithen [mailto:andreas.veithen@skynet.be] 
Sent: Monday, June 09, 2008 2:59 PM
To: axis-user@ws.apache.org
Subject: Re: Invalid UTF-8 character encoding in SOAP response

Aman,

D869 DE1A is actually the surrogate pair for the character with code  
point 2A61A, which is encoded as F0AA989A in UTF-8 (see
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi) 
. The two other character references (&#xD858;&#xDF4C;) correspond to  
another character. I'm not an expert, but the XML specs don't mention  
surrogate pairs and I think that the correct way of encoding the  
character as a character reference should be &#x2A61A; in this case.  
This definitely looks like a bug in the XML parser. I would try to  
replace the XML parser by a new version of the same parser or by  
another parser. I'm not familiar with Axis 1, so I don't know what  
kind of parser (SAX or StAX) it uses. Maybe somebody else on the list  
can give a hint?

Andreas


On 9 juin 08, at 22:18, Amandeep Singh wrote:

> Hi All,
>
> I am using axis 1.3. If the response contains a CJK character in  
> UTF-8, axis converts it into an xml entity. On the receiver side,  
> xml parsing fails saying that it is an invalid xml entity.
>
> The character used has UTF-8 value F0AA989A. And axis converts it  
> into &#xD869;&#xDE1A;&#xD858;&#xDF4C;. And parser fails at first  
> entity.
>
> Any ideas/hints would be greatly appreciated?
>
> Thanks,
> Aman


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


Re: Invalid UTF-8 character encoding in SOAP response

Posted by Andreas Veithen <an...@skynet.be>.
Aman,

D869 DE1A is actually the surrogate pair for the character with code  
point 2A61A, which is encoded as F0AA989A in UTF-8 (see http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi) 
. The two other character references (&#xD858;&#xDF4C;) correspond to  
another character. I'm not an expert, but the XML specs don't mention  
surrogate pairs and I think that the correct way of encoding the  
character as a character reference should be &#x2A61A; in this case.  
This definitely looks like a bug in the XML parser. I would try to  
replace the XML parser by a new version of the same parser or by  
another parser. I'm not familiar with Axis 1, so I don't know what  
kind of parser (SAX or StAX) it uses. Maybe somebody else on the list  
can give a hint?

Andreas


On 9 juin 08, at 22:18, Amandeep Singh wrote:

> Hi All,
>
> I am using axis 1.3. If the response contains a CJK character in  
> UTF-8, axis converts it into an xml entity. On the receiver side,  
> xml parsing fails saying that it is an invalid xml entity.
>
> The character used has UTF-8 value F0AA989A. And axis converts it  
> into &#xD869;&#xDE1A;&#xD858;&#xDF4C;. And parser fails at first  
> entity.
>
> Any ideas/hints would be greatly appreciated?
>
> Thanks,
> Aman


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org