You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cxf.apache.org by "Wolf, Chris (IT)" <Ch...@morganstanley.com> on 2008/08/19 22:47:36 UTC
Dealing with non-UTF-8 messages
We now have some messages that contain iso-8859-1 (but non-ascii)
characters,
which apparently, are not UTF-8, e.g. accented characters used in
Spanish and/or
French. These are causing exceptions citing invalid UTF-8 character
sequences.
I was able to eliminte one source by explicitly setting the encoding to
"iso-8859-1"
in the parser of one of our interceptors, but not the error simply
migrated to one
of the CXF built-in interceptors. Is there a way to globally set the
encoding?
I searched the FAQs and list archives and found nothing helpful.
Thanks,
-Chris W.
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
RE: Dealing with non-UTF-8 messages
Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.
By "prolog", I assume you mean xml processing instruction ("xml pi").
http://www.w3.org/TR/REC-xml/#NT-PI
http://www.w3.org/TR/REC-xml/#NT-XMLDecl
I'm pretty sure these are used only for general data streams, such as
files and resources
to help the consumer identify and process the data as xml. In the case
of SOAP, I think
the fact the the data is XML is implied - I've never seen SOAP messages
use xml-PIs the
way you suggest.
There must be some configuration to set this.
Thanks,
-Chris W.
-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com]
Sent: Tuesday, August 19, 2008 10:16 PM
To: users@cxf.apache.org
Subject: Re: Dealing with non-UTF-8 messages
This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.
Better, put an interceptor at the front to transcode to UTF-8 or add
prologs?
Can you tell us if this problem is specific to some app of yours or more
generic, to help motivate (or not) effort?
On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii)
> characters, which apparently, are not UTF-8, e.g. accented characters
> used in Spanish and/or French. These are causing exceptions citing
> invalid UTF-8 character sequences.
>
> I was able to eliminte one source by explicitly setting the encoding
> to "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply
> migrated to one of the CXF built-in interceptors. Is there a way to
> globally set the encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
> -Chris W.
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
Re: Dealing with non-UTF-8 messages
Posted by Freeman Fang <fr...@gmail.com>.
Dan is correct, I just verify currently mtom can't work with JMS
transport because of missing content-type header, fill jira [1] to track it.
[1]https://issues.apache.org/jira/browse/CXF-1760
Regards
Freeman
Daniel Kulp wrote:
> Christian,
>
> We can put a "Content-Type" header there if needed. It really should go
> there, especially for byte[] messages. It's kind of important as we'd need
> to know if the byte[] is a MTOM byte[] or not and things like that.
>
> (note: Freeman and I were just talking a little about this on
> dev@cxf.apache.org related to a commit he just made to the JMS conduit)
>
> Dan
>
>
>
> On Wednesday 20 August 2008 11:46:15 am Christian Schneider wrote:
>
>> How should the encoding issue be handled for JMS? As far as I know there
>> is no content type header in JMS.
>>
>> https://issues.apache.org/jira/browse/CXF-1668
>>
>> Best regards
>>
>> Christian
>>
>> Daniel Kulp schrieb:
>>
>>> Chris,
>>>
>>> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
>>>
>>>> From looking at the SOAP spec, it seems that it's the responsibilty
>>>> of the transport to indicate the encoding, as these samples show:
>>>>
>>>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>>>>
>>>> Note the "Content-Type:" HTTP header.
>>>>
>>>> Also note that the SOAP spec explicitly states that SOAP messages MUST
>>>> NOT
>>>> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
>>>> declaration)
>>>>
>>>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>>>>
>>>>
>>>> I wonder if we change our client to set the outboud content-type header
>>>> to
>>>> indicate the encoding, if this will fix it. I will look into this
>>>> angle.
>>>>
>>> Yes, that is the proper fix. The charset in the Content-Type header is
>>> what is used to create the parser. Thus, you need to make sure the
>>> client sets the proper charset there.
>>>
>>> Dan
>>>
>
>
>
>
Re: Dealing with non-UTF-8 messages
Posted by Daniel Kulp <dk...@apache.org>.
Christian,
We can put a "Content-Type" header there if needed. It really should go
there, especially for byte[] messages. It's kind of important as we'd need
to know if the byte[] is a MTOM byte[] or not and things like that.
(note: Freeman and I were just talking a little about this on
dev@cxf.apache.org related to a commit he just made to the JMS conduit)
Dan
On Wednesday 20 August 2008 11:46:15 am Christian Schneider wrote:
> How should the encoding issue be handled for JMS? As far as I know there
> is no content type header in JMS.
>
> https://issues.apache.org/jira/browse/CXF-1668
>
> Best regards
>
> Christian
>
> Daniel Kulp schrieb:
> > Chris,
> >
> > On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> >> From looking at the SOAP spec, it seems that it's the responsibilty
> >> of the transport to indicate the encoding, as these samples show:
> >>
> >> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >>
> >> Note the "Content-Type:" HTTP header.
> >>
> >> Also note that the SOAP spec explicitly states that SOAP messages MUST
> >> NOT
> >> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
> >> declaration)
> >>
> >> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >>
> >>
> >> I wonder if we change our client to set the outboud content-type header
> >> to
> >> indicate the encoding, if this will fix it. I will look into this
> >> angle.
> >
> > Yes, that is the proper fix. The charset in the Content-Type header is
> > what is used to create the parser. Thus, you need to make sure the
> > client sets the proper charset there.
> >
> > Dan
--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
RE: Dealing with non-UTF-8 messages
Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.
I'll file when I verify for sure with a programmatic client.
Thanks...
-Chris W.
-----Original Message-----
From: Daniel Kulp [mailto:dkulp@apache.org]
Sent: Tuesday, August 26, 2008 3:18 PM
To: users@cxf.apache.org
Cc: Wolf, Chris (IT)
Subject: Re: Dealing with non-UTF-8 messages
Hmm..... this actually is starting to sound like a bug. If the
charset
isn't specified, we probably should default it in as 8859-1 like the
spec
says. That would definitely be for the http transport only.
Could you log a bug? Maybe with a patch. :-)
Dan
On Monday 25 August 2008 4:56:19 pm Wolf, Chris (IT) wrote:
> We're using a JSP page with a declaration like:
>
> <%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>
>
> ...and textarea within a form, containing a template request document.
>
> However, even after changing the browser's (FF-2, or IE) default
> character encoding to iso-8859-1 AND changing the FORM tag like:
>
> <form action="./index.jsp" method="post" name="theForm"
> enctype="application/x-www-form-urlencoded; charset=ISO-8859-1"
> accept-charset="ISO-8859-1">
> </form>
>
> A sniffer shows that when the browser first loads the page (GET), it
> sends,
> "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"
>
> ...but when the form is posted, it only sends:
>
> "Content-type: application/x-www-form-urlencoded"
>
> (note, it's without the "; charset=iso-8859-1" sub-value)
>
> On the other hand, this w3.org document states that HTTP-1.1 implies a
> default encoding of iso-8859-1, rather then utf-8, so even the above
> mentioned steps should not have been necessary, correct?
> http://www.w3.org/International/O-HTTP-charset
>
>
> Does this mean there's no way to test an iso-8859-1 request document
> containing accented characters using an HTML FORM POST?
> i.e. We can only use a programmatic client?
>
>
> Thanks,
>
> -Chris W.
>
> -----Original Message-----
> From: Daniel Kulp [mailto:dkulp@apache.org]
> Sent: Wednesday, August 20, 2008 11:23 AM
> To: users@cxf.apache.org
> Cc: Wolf, Chris (IT)
> Subject: Re: Dealing with non-UTF-8 messages
>
>
> Chris,
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> > From looking at the SOAP spec, it seems that it's the responsibilty
> > of
> >
> > the transport to indicate the encoding, as these samples show:
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >
> > Note the "Content-Type:" HTTP header.
> >
> > Also note that the SOAP spec explicitly states that SOAP messages
> > MUST
> >
> > NOT (their emphasis) contain processing instructions (e.g. the
> > "<?xml...?>"
> > declaration)
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >
> >
> > I wonder if we change our client to set the outboud content-type
> > header to indicate the encoding, if this will fix it. I will look
> > into this angle.
>
> Yes, that is the proper fix. The charset in the Content-Type header
is
> what
> is used to create the parser. Thus, you need to make sure the client
> sets
> the proper charset there.
>
> Dan
>
> > Thanks,
> >
> > -Chris W.
> >
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Tuesday, August 19, 2008 10:16 PM
> > To: users@cxf.apache.org
> > Subject: Re: Dealing with non-UTF-8 messages
> >
> > This is not a current feature of CXF. If the messages had prologs
> > with
> >
> > encodings, all would work.
> >
> > Better, put an interceptor at the front to transcode to UTF-8 or add
> > prologs?
> >
> > Can you tell us if this problem is specific to some app of yours or
> > more generic, to help motivate (or not) effort?
> >
> > On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
> >
> > <Ch...@morganstanley.com> wrote:
> > > We now have some messages that contain iso-8859-1 (but non-ascii)
> > > characters, which apparently, are not UTF-8, e.g. accented
> > > characters used in Spanish and/or French. These are causing
> > > exceptions citing invalid UTF-8 character sequences.
> > >
> > > I was able to eliminte one source by explicitly setting the
> > > encoding
> > >
> > > to "iso-8859-1"
> > > in the parser of one of our interceptors, but not the error simply
> > > migrated to one of the CXF built-in interceptors. Is there a way
> > > to
> > >
> > > globally set the encoding?
> > >
> > > I searched the FAQs and list archives and found nothing helpful.
> > >
> > > Thanks,
> > >
> > > -Chris W.
> > > --------------------------------------------------------
> > >
> > > NOTICE: If received in error, please destroy and notify sender.
> > > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
> >
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender.
> > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
>
> --
> Daniel Kulp
> dkulp@apache.org
> http://www.dankulp.com/blog
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender
> does not intend to waive confidentiality or privilege. Use of this
> email is prohibited when received in error.
--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
Re: Dealing with non-UTF-8 messages
Posted by Daniel Kulp <dk...@apache.org>.
Hmm..... this actually is starting to sound like a bug. If the charset
isn't specified, we probably should default it in as 8859-1 like the spec
says. That would definitely be for the http transport only.
Could you log a bug? Maybe with a patch. :-)
Dan
On Monday 25 August 2008 4:56:19 pm Wolf, Chris (IT) wrote:
> We're using a JSP page with a declaration like:
>
> <%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>
>
> ...and textarea within a form, containing a template request document.
>
> However, even after changing the browser's (FF-2, or IE) default
> character encoding to iso-8859-1 AND changing the FORM tag like:
>
> <form action="./index.jsp" method="post" name="theForm"
> enctype="application/x-www-form-urlencoded; charset=ISO-8859-1"
> accept-charset="ISO-8859-1">
> </form>
>
> A sniffer shows that when the browser first loads the page (GET), it
> sends,
> "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"
>
> ...but when the form is posted, it only sends:
>
> "Content-type: application/x-www-form-urlencoded"
>
> (note, it's without the "; charset=iso-8859-1" sub-value)
>
> On the other hand, this w3.org document states that HTTP-1.1 implies a
> default encoding of iso-8859-1, rather then utf-8, so even the above
> mentioned steps should not have been necessary, correct?
> http://www.w3.org/International/O-HTTP-charset
>
>
> Does this mean there's no way to test an iso-8859-1 request
> document containing accented characters using an HTML FORM POST?
> i.e. We can only use a programmatic client?
>
>
> Thanks,
>
> -Chris W.
>
> -----Original Message-----
> From: Daniel Kulp [mailto:dkulp@apache.org]
> Sent: Wednesday, August 20, 2008 11:23 AM
> To: users@cxf.apache.org
> Cc: Wolf, Chris (IT)
> Subject: Re: Dealing with non-UTF-8 messages
>
>
> Chris,
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> > From looking at the SOAP spec, it seems that it's the responsibilty of
> >
> > the transport to indicate the encoding, as these samples show:
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >
> > Note the "Content-Type:" HTTP header.
> >
> > Also note that the SOAP spec explicitly states that SOAP messages MUST
> >
> > NOT (their emphasis) contain processing instructions (e.g. the
> > "<?xml...?>"
> > declaration)
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >
> >
> > I wonder if we change our client to set the outboud content-type
> > header to indicate the encoding, if this will fix it. I will look
> > into this angle.
>
> Yes, that is the proper fix. The charset in the Content-Type header is
> what
> is used to create the parser. Thus, you need to make sure the client
> sets
> the proper charset there.
>
> Dan
>
> > Thanks,
> >
> > -Chris W.
> >
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Tuesday, August 19, 2008 10:16 PM
> > To: users@cxf.apache.org
> > Subject: Re: Dealing with non-UTF-8 messages
> >
> > This is not a current feature of CXF. If the messages had prologs with
> >
> > encodings, all would work.
> >
> > Better, put an interceptor at the front to transcode to UTF-8 or add
> > prologs?
> >
> > Can you tell us if this problem is specific to some app of yours or
> > more generic, to help motivate (or not) effort?
> >
> > On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
> >
> > <Ch...@morganstanley.com> wrote:
> > > We now have some messages that contain iso-8859-1 (but non-ascii)
> > > characters, which apparently, are not UTF-8, e.g. accented
> > > characters used in Spanish and/or French. These are causing
> > > exceptions citing invalid UTF-8 character sequences.
> > >
> > > I was able to eliminte one source by explicitly setting the encoding
> > >
> > > to "iso-8859-1"
> > > in the parser of one of our interceptors, but not the error simply
> > > migrated to one of the CXF built-in interceptors. Is there a way to
> > >
> > > globally set the encoding?
> > >
> > > I searched the FAQs and list archives and found nothing helpful.
> > >
> > > Thanks,
> > >
> > > -Chris W.
> > > --------------------------------------------------------
> > >
> > > NOTICE: If received in error, please destroy and notify sender.
> > > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
> >
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
>
> --
> Daniel Kulp
> dkulp@apache.org
> http://www.dankulp.com/blog
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does
> not intend to waive confidentiality or privilege. Use of this email is
> prohibited when received in error.
--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
Re: Dealing with non-UTF-8 messages
Posted by Christian Schneider <ch...@die-schneider.net>.
How should the encoding issue be handled for JMS? As far as I know there
is no content type header in JMS.
https://issues.apache.org/jira/browse/CXF-1668
Best regards
Christian
Daniel Kulp schrieb:
> Chris,
>
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
>
>> From looking at the SOAP spec, it seems that it's the responsibilty
>> of the transport to indicate the encoding, as these samples show:
>>
>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>>
>> Note the "Content-Type:" HTTP header.
>>
>> Also note that the SOAP spec explicitly states that SOAP messages MUST
>> NOT
>> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
>> declaration)
>>
>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>>
>>
>> I wonder if we change our client to set the outboud content-type header
>> to
>> indicate the encoding, if this will fix it. I will look into this
>> angle.
>>
>
> Yes, that is the proper fix. The charset in the Content-Type header is what
> is used to create the parser. Thus, you need to make sure the client sets
> the proper charset there.
>
> Dan
>
>
>
RE: Dealing with non-UTF-8 messages
Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.
We're using a JSP page with a declaration like:
<%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>
...and textarea within a form, containing a template request document.
However, even after changing the browser's (FF-2, or IE) default
character encoding to iso-8859-1 AND changing the FORM tag like:
<form action="./index.jsp" method="post" name="theForm"
enctype="application/x-www-form-urlencoded; charset=ISO-8859-1"
accept-charset="ISO-8859-1">
</form>
A sniffer shows that when the browser first loads the page (GET), it
sends,
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"
...but when the form is posted, it only sends:
"Content-type: application/x-www-form-urlencoded"
(note, it's without the "; charset=iso-8859-1" sub-value)
On the other hand, this w3.org document states that HTTP-1.1 implies a
default encoding of iso-8859-1, rather then utf-8, so even the above
mentioned steps should not have been necessary, correct?
http://www.w3.org/International/O-HTTP-charset
Does this mean there's no way to test an iso-8859-1 request
document containing accented characters using an HTML FORM POST?
i.e. We can only use a programmatic client?
Thanks,
-Chris W.
-----Original Message-----
From: Daniel Kulp [mailto:dkulp@apache.org]
Sent: Wednesday, August 20, 2008 11:23 AM
To: users@cxf.apache.org
Cc: Wolf, Chris (IT)
Subject: Re: Dealing with non-UTF-8 messages
Chris,
On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> From looking at the SOAP spec, it seems that it's the responsibilty of
> the transport to indicate the encoding, as these samples show:
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>
> Note the "Content-Type:" HTTP header.
>
> Also note that the SOAP spec explicitly states that SOAP messages MUST
> NOT (their emphasis) contain processing instructions (e.g. the
> "<?xml...?>"
> declaration)
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>
>
> I wonder if we change our client to set the outboud content-type
> header to indicate the encoding, if this will fix it. I will look
> into this angle.
Yes, that is the proper fix. The charset in the Content-Type header is
what
is used to create the parser. Thus, you need to make sure the client
sets
the proper charset there.
Dan
> Thanks,
>
> -Chris W.
>
>
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Tuesday, August 19, 2008 10:16 PM
> To: users@cxf.apache.org
> Subject: Re: Dealing with non-UTF-8 messages
>
> This is not a current feature of CXF. If the messages had prologs with
> encodings, all would work.
>
> Better, put an interceptor at the front to transcode to UTF-8 or add
> prologs?
>
> Can you tell us if this problem is specific to some app of yours or
> more generic, to help motivate (or not) effort?
>
> On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
>
> <Ch...@morganstanley.com> wrote:
> > We now have some messages that contain iso-8859-1 (but non-ascii)
> > characters, which apparently, are not UTF-8, e.g. accented
> > characters used in Spanish and/or French. These are causing
> > exceptions citing invalid UTF-8 character sequences.
> >
> > I was able to eliminte one source by explicitly setting the encoding
> > to "iso-8859-1"
> > in the parser of one of our interceptors, but not the error simply
> > migrated to one of the CXF built-in interceptors. Is there a way to
> > globally set the encoding?
> >
> > I searched the FAQs and list archives and found nothing helpful.
> >
> > Thanks,
> >
> > -Chris W.
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender.
> > Sender
>
> does not intend to waive confidentiality or privilege. Use of this
> email is prohibited when received in error.
>
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender
> does not intend to waive confidentiality or privilege. Use of this
> email is prohibited when received in error.
--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
Re: Dealing with non-UTF-8 messages
Posted by Daniel Kulp <dk...@apache.org>.
Chris,
On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> From looking at the SOAP spec, it seems that it's the responsibilty
> of the transport to indicate the encoding, as these samples show:
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>
> Note the "Content-Type:" HTTP header.
>
> Also note that the SOAP spec explicitly states that SOAP messages MUST
> NOT
> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
> declaration)
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>
>
> I wonder if we change our client to set the outboud content-type header
> to
> indicate the encoding, if this will fix it. I will look into this
> angle.
Yes, that is the proper fix. The charset in the Content-Type header is what
is used to create the parser. Thus, you need to make sure the client sets
the proper charset there.
Dan
> Thanks,
>
> -Chris W.
>
>
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Tuesday, August 19, 2008 10:16 PM
> To: users@cxf.apache.org
> Subject: Re: Dealing with non-UTF-8 messages
>
> This is not a current feature of CXF. If the messages had prologs with
> encodings, all would work.
>
> Better, put an interceptor at the front to transcode to UTF-8 or add
> prologs?
>
> Can you tell us if this problem is specific to some app of yours or more
> generic, to help motivate (or not) effort?
>
> On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
>
> <Ch...@morganstanley.com> wrote:
> > We now have some messages that contain iso-8859-1 (but non-ascii)
> > characters, which apparently, are not UTF-8, e.g. accented characters
> > used in Spanish and/or French. These are causing exceptions citing
> > invalid UTF-8 character sequences.
> >
> > I was able to eliminte one source by explicitly setting the encoding
> > to "iso-8859-1"
> > in the parser of one of our interceptors, but not the error simply
> > migrated to one of the CXF built-in interceptors. Is there a way to
> > globally set the encoding?
> >
> > I searched the FAQs and list archives and found nothing helpful.
> >
> > Thanks,
> >
> > -Chris W.
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. Sender
>
> does not intend to waive confidentiality or privilege. Use of this email
> is prohibited when received in error.
>
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does
> not intend to waive confidentiality or privilege. Use of this email is
> prohibited when received in error.
--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
Re: Dealing with non-UTF-8 messages
Posted by Ian Roberts <i....@dcs.shef.ac.uk>.
Wolf, Chris (IT) wrote:
>>>From looking at the SOAP spec, it seems that it's the responsibilty
> of the transport to indicate the encoding, as these samples show:
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>
> Note the "Content-Type:" HTTP header.
>
> Also note that the SOAP spec explicitly states that SOAP messages MUST
> NOT
> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
> declaration)
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
Technically, <?xml ...?> isn't a processing instruction. It looks like
one, but it's actually a different kind of object called the "XML
declaration" and it is only allowed to appear at the very top of the
document, whereas true processing instructions (<?anything-except-xml
...?>) can appear anywhere.
http://www.w3.org/TR/REC-xml/#sec-prolog-dtd
http://www.w3.org/TR/REC-xml/#sec-pi
However, the parser is allowed to ignore the encoding specified in the
XML declaration if it has some external way to know what the encoding is
(such as the Content-type header).
http://www.w3.org/TR/REC-xml/#charencoding
But if the request had a content type like application/xml then it would
be the XML declaration that determines the encoding used.
Ian
--
Ian Roberts | Department of Computer Science
i.roberts@dcs.shef.ac.uk | University of Sheffield, UK
RE: Dealing with non-UTF-8 messages
Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.
>From looking at the SOAP spec, it seems that it's the responsibilty
of the transport to indicate the encoding, as these samples show:
http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
Note the "Content-Type:" HTTP header.
Also note that the SOAP spec explicitly states that SOAP messages MUST
NOT
(their emphasis) contain processing instructions (e.g. the "<?xml...?>"
declaration)
http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
I wonder if we change our client to set the outboud content-type header
to
indicate the encoding, if this will fix it. I will look into this
angle.
Thanks,
-Chris W.
-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com]
Sent: Tuesday, August 19, 2008 10:16 PM
To: users@cxf.apache.org
Subject: Re: Dealing with non-UTF-8 messages
This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.
Better, put an interceptor at the front to transcode to UTF-8 or add
prologs?
Can you tell us if this problem is specific to some app of yours or more
generic, to help motivate (or not) effort?
On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii)
> characters, which apparently, are not UTF-8, e.g. accented characters
> used in Spanish and/or French. These are causing exceptions citing
> invalid UTF-8 character sequences.
>
> I was able to eliminte one source by explicitly setting the encoding
> to "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply
> migrated to one of the CXF built-in interceptors. Is there a way to
> globally set the encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
> -Chris W.
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender
does not intend to waive confidentiality or privilege. Use of this email
is prohibited when received in error.
>
--------------------------------------------------------
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
Re: Dealing with non-UTF-8 messages
Posted by Benson Margulies <bi...@gmail.com>.
This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.
Better, put an interceptor at the front to transcode to UTF-8 or add prologs?
Can you tell us if this problem is specific to some app of yours or
more generic, to help motivate (or not) effort?
On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii)
> characters,
> which apparently, are not UTF-8, e.g. accented characters used in
> Spanish and/or
> French. These are causing exceptions citing invalid UTF-8 character
> sequences.
>
> I was able to eliminte one source by explicitly setting the encoding to
> "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply
> migrated to one
> of the CXF built-in interceptors. Is there a way to globally set the
> encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
> -Chris W.
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
>