You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cxf.apache.org by "Wolf, Chris (IT)" <Ch...@morganstanley.com> on 2008/08/19 22:47:36 UTC

Dealing with non-UTF-8 messages

We now have some messages that contain iso-8859-1 (but non-ascii)
characters,
which apparently, are not UTF-8, e.g. accented characters used in
Spanish and/or
French.  These are causing exceptions citing invalid UTF-8 character
sequences.

I was able to eliminte one source by explicitly setting the encoding to
"iso-8859-1"
in the parser of one of our interceptors, but not the error simply
migrated to one
of the CXF built-in interceptors.  Is there a way to globally set the
encoding?

I searched the FAQs and list archives and found nothing helpful.

Thanks,

   -Chris W.
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

RE: Dealing with non-UTF-8 messages

Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.

By "prolog", I assume you mean xml processing instruction ("xml pi").  
http://www.w3.org/TR/REC-xml/#NT-PI
http://www.w3.org/TR/REC-xml/#NT-XMLDecl  

I'm pretty sure these are used only for general data streams, such as
files and resources
to help the consumer identify and process the data as xml.  In the case
of SOAP, I think
the fact the the data is XML is implied - I've never seen SOAP messages
use xml-PIs the
way you suggest.

There must be some configuration to set this.

Thanks,

   -Chris W.

-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com] 
Sent: Tuesday, August 19, 2008 10:16 PM
To: users@cxf.apache.org
Subject: Re: Dealing with non-UTF-8 messages

This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.

Better, put an interceptor at the front to transcode to UTF-8 or add
prologs?

Can you tell us if this problem is specific to some app of yours or more
generic, to help motivate (or not) effort?

On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii) 
> characters, which apparently, are not UTF-8, e.g. accented characters 
> used in Spanish and/or French.  These are causing exceptions citing 
> invalid UTF-8 character sequences.
>
> I was able to eliminte one source by explicitly setting the encoding 
> to "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply 
> migrated to one of the CXF built-in interceptors.  Is there a way to 
> globally set the encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
>   -Chris W.
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re: Dealing with non-UTF-8 messages

Posted by Freeman Fang <fr...@gmail.com>.

Dan is correct, I just verify currently mtom can't work with JMS 
transport because of missing content-type header, fill jira [1] to track it.

[1]https://issues.apache.org/jira/browse/CXF-1760

Regards
Freeman

Daniel Kulp wrote:
> Christian,
>
> We can put a "Content-Type" header there if needed.   It really should go 
> there, especially for byte[] messages.    It's kind of important as we'd need 
> to know if the byte[] is a MTOM byte[] or not and things like that.
>
> (note: Freeman and I were just talking a little about this on 
> dev@cxf.apache.org related to a commit he just made to the JMS conduit)
>
> Dan
>
>
>
> On Wednesday 20 August 2008 11:46:15 am Christian Schneider wrote:
>   
>> How should the encoding issue be handled for JMS? As far as I know there
>> is no content type header in JMS.
>>
>> https://issues.apache.org/jira/browse/CXF-1668
>>
>> Best regards
>>
>> Christian
>>
>> Daniel Kulp schrieb:
>>     
>>> Chris,
>>>
>>> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
>>>       
>>>> From looking at the SOAP spec, it seems that it's the responsibilty
>>>> of the transport to indicate the encoding, as these samples show:
>>>>
>>>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>>>>
>>>> Note the "Content-Type:" HTTP header.
>>>>
>>>> Also note that the SOAP spec explicitly states that SOAP messages MUST
>>>> NOT
>>>> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
>>>> declaration)
>>>>
>>>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>>>>
>>>>
>>>> I wonder if we change our client to set the outboud content-type header
>>>> to
>>>> indicate the encoding, if this will fix it.  I will look into this
>>>> angle.
>>>>         
>>> Yes, that is the proper fix.   The charset in the Content-Type header is
>>> what is used to create the parser.   Thus, you need to make sure the
>>> client sets the proper charset there.
>>>
>>> Dan
>>>       
>
>
>
>

Re: Dealing with non-UTF-8 messages

Posted by Daniel Kulp <dk...@apache.org>.

Christian,

We can put a "Content-Type" header there if needed.   It really should go 
there, especially for byte[] messages.    It's kind of important as we'd need 
to know if the byte[] is a MTOM byte[] or not and things like that.

(note: Freeman and I were just talking a little about this on 
dev@cxf.apache.org related to a commit he just made to the JMS conduit)

Dan



On Wednesday 20 August 2008 11:46:15 am Christian Schneider wrote:
> How should the encoding issue be handled for JMS? As far as I know there
> is no content type header in JMS.
>
> https://issues.apache.org/jira/browse/CXF-1668
>
> Best regards
>
> Christian
>
> Daniel Kulp schrieb:
> > Chris,
> >
> > On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> >> From looking at the SOAP spec, it seems that it's the responsibilty
> >> of the transport to indicate the encoding, as these samples show:
> >>
> >> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >>
> >> Note the "Content-Type:" HTTP header.
> >>
> >> Also note that the SOAP spec explicitly states that SOAP messages MUST
> >> NOT
> >> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
> >> declaration)
> >>
> >> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >>
> >>
> >> I wonder if we change our client to set the outboud content-type header
> >> to
> >> indicate the encoding, if this will fix it.  I will look into this
> >> angle.
> >
> > Yes, that is the proper fix.   The charset in the Content-Type header is
> > what is used to create the parser.   Thus, you need to make sure the
> > client sets the proper charset there.
> >
> > Dan



-- 
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog

RE: Dealing with non-UTF-8 messages

Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.

I'll file when I verify for sure with a programmatic client.
Thanks...

   -Chris W.

-----Original Message-----
From: Daniel Kulp [mailto:dkulp@apache.org] 
Sent: Tuesday, August 26, 2008 3:18 PM
To: users@cxf.apache.org
Cc: Wolf, Chris (IT)
Subject: Re: Dealing with non-UTF-8 messages



Hmm.....   this actually is starting to sound like a bug.   If the
charset 
isn't specified, we probably should default it in as 8859-1 like the
spec 
says.   That would definitely be for the http transport only.   

Could you log a bug?  Maybe with a patch.  :-)

Dan



On Monday 25 August 2008 4:56:19 pm Wolf, Chris (IT) wrote:
> We're using a JSP page with a declaration like:
>
> <%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>
>
> ...and textarea within a form, containing a template request document.
>
> However, even after changing the browser's (FF-2, or IE) default 
> character encoding to iso-8859-1 AND changing the FORM tag like:
>
> <form action="./index.jsp" method="post" name="theForm"
>       enctype="application/x-www-form-urlencoded; charset=ISO-8859-1"
>       accept-charset="ISO-8859-1">
> </form>
>
> A sniffer shows that when the browser first loads the page (GET), it 
> sends,
> "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"
>
> ...but when the form is posted, it only sends:
>
> "Content-type: application/x-www-form-urlencoded"
>
> (note, it's without the "; charset=iso-8859-1" sub-value)
>
> On the other hand, this w3.org document states that HTTP-1.1 implies a

> default encoding of iso-8859-1, rather then utf-8, so even the above 
> mentioned steps should not have been necessary, correct?
> http://www.w3.org/International/O-HTTP-charset
>
>
> Does this mean there's no way to test an iso-8859-1 request document 
> containing accented characters using an HTML FORM POST?
> i.e. We can only use a programmatic client?
>
>
> Thanks,
>
>    -Chris W.
>
> -----Original Message-----
> From: Daniel Kulp [mailto:dkulp@apache.org]
> Sent: Wednesday, August 20, 2008 11:23 AM
> To: users@cxf.apache.org
> Cc: Wolf, Chris (IT)
> Subject: Re: Dealing with non-UTF-8 messages
>
>
> Chris,
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> > From looking at the SOAP spec, it seems that it's the responsibilty 
> > of
> >
> > the transport to indicate the encoding, as these samples show:
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >
> > Note the "Content-Type:" HTTP header.
> >
> > Also note that the SOAP spec explicitly states that SOAP messages 
> > MUST
> >
> > NOT (their emphasis) contain processing instructions (e.g. the 
> > "<?xml...?>"
> > declaration)
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >
> >
> > I wonder if we change our client to set the outboud content-type 
> > header to indicate the encoding, if this will fix it.  I will look 
> > into this angle.
>
> Yes, that is the proper fix.   The charset in the Content-Type header
is
> what
> is used to create the parser.   Thus, you need to make sure the client
> sets
> the proper charset there.
>
> Dan
>
> > Thanks,
> >
> >    -Chris W.
> >
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Tuesday, August 19, 2008 10:16 PM
> > To: users@cxf.apache.org
> > Subject: Re: Dealing with non-UTF-8 messages
> >
> > This is not a current feature of CXF. If the messages had prologs 
> > with
> >
> > encodings, all would work.
> >
> > Better, put an interceptor at the front to transcode to UTF-8 or add

> > prologs?
> >
> > Can you tell us if this problem is specific to some app of yours or 
> > more generic, to help motivate (or not) effort?
> >
> > On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
> >
> > <Ch...@morganstanley.com> wrote:
> > > We now have some messages that contain iso-8859-1 (but non-ascii) 
> > > characters, which apparently, are not UTF-8, e.g. accented 
> > > characters used in Spanish and/or French.  These are causing 
> > > exceptions citing invalid UTF-8 character sequences.
> > >
> > > I was able to eliminte one source by explicitly setting the 
> > > encoding
> > >
> > > to "iso-8859-1"
> > > in the parser of one of our interceptors, but not the error simply

> > > migrated to one of the CXF built-in interceptors.  Is there a way 
> > > to
> > >
> > > globally set the encoding?
> > >
> > > I searched the FAQs and list archives and found nothing helpful.
> > >
> > > Thanks,
> > >
> > >   -Chris W.
> > > --------------------------------------------------------
> > >
> > > NOTICE: If received in error, please destroy and notify sender.
> > > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this 
> > email is prohibited when received in error.
> >
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. 
> > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this 
> > email is prohibited when received in error.
>
> --
> Daniel Kulp
> dkulp@apache.org
> http://www.dankulp.com/blog
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender

> does not intend to waive confidentiality or privilege. Use of this 
> email is prohibited when received in error.



--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re: Dealing with non-UTF-8 messages

Posted by Daniel Kulp <dk...@apache.org>.


Hmm.....   this actually is starting to sound like a bug.   If the charset 
isn't specified, we probably should default it in as 8859-1 like the spec 
says.   That would definitely be for the http transport only.   

Could you log a bug?  Maybe with a patch.  :-)

Dan



On Monday 25 August 2008 4:56:19 pm Wolf, Chris (IT) wrote:
> We're using a JSP page with a declaration like:
>
> <%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>
>
> ...and textarea within a form, containing a template request document.
>
> However, even after changing the browser's (FF-2, or IE) default
> character encoding to iso-8859-1 AND changing the FORM tag like:
>
> <form action="./index.jsp" method="post" name="theForm"
>       enctype="application/x-www-form-urlencoded; charset=ISO-8859-1"
>       accept-charset="ISO-8859-1">
> </form>
>
> A sniffer shows that when the browser first loads the page (GET), it
> sends,
> "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"
>
> ...but when the form is posted, it only sends:
>
> "Content-type: application/x-www-form-urlencoded"
>
> (note, it's without the "; charset=iso-8859-1" sub-value)
>
> On the other hand, this w3.org document states that HTTP-1.1 implies a
> default encoding of iso-8859-1, rather then utf-8, so even the above
> mentioned steps should not have been necessary, correct?
> http://www.w3.org/International/O-HTTP-charset
>
>
> Does this mean there's no way to test an iso-8859-1 request
> document containing accented characters using an HTML FORM POST?
> i.e. We can only use a programmatic client?
>
>
> Thanks,
>
>    -Chris W.
>
> -----Original Message-----
> From: Daniel Kulp [mailto:dkulp@apache.org]
> Sent: Wednesday, August 20, 2008 11:23 AM
> To: users@cxf.apache.org
> Cc: Wolf, Chris (IT)
> Subject: Re: Dealing with non-UTF-8 messages
>
>
> Chris,
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> > From looking at the SOAP spec, it seems that it's the responsibilty of
> >
> > the transport to indicate the encoding, as these samples show:
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> >
> > Note the "Content-Type:" HTTP header.
> >
> > Also note that the SOAP spec explicitly states that SOAP messages MUST
> >
> > NOT (their emphasis) contain processing instructions (e.g. the
> > "<?xml...?>"
> > declaration)
> >
> > http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
> >
> >
> > I wonder if we change our client to set the outboud content-type
> > header to indicate the encoding, if this will fix it.  I will look
> > into this angle.
>
> Yes, that is the proper fix.   The charset in the Content-Type header is
> what
> is used to create the parser.   Thus, you need to make sure the client
> sets
> the proper charset there.
>
> Dan
>
> > Thanks,
> >
> >    -Chris W.
> >
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Tuesday, August 19, 2008 10:16 PM
> > To: users@cxf.apache.org
> > Subject: Re: Dealing with non-UTF-8 messages
> >
> > This is not a current feature of CXF. If the messages had prologs with
> >
> > encodings, all would work.
> >
> > Better, put an interceptor at the front to transcode to UTF-8 or add
> > prologs?
> >
> > Can you tell us if this problem is specific to some app of yours or
> > more generic, to help motivate (or not) effort?
> >
> > On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
> >
> > <Ch...@morganstanley.com> wrote:
> > > We now have some messages that contain iso-8859-1 (but non-ascii)
> > > characters, which apparently, are not UTF-8, e.g. accented
> > > characters used in Spanish and/or French.  These are causing
> > > exceptions citing invalid UTF-8 character sequences.
> > >
> > > I was able to eliminte one source by explicitly setting the encoding
> > >
> > > to "iso-8859-1"
> > > in the parser of one of our interceptors, but not the error simply
> > > migrated to one of the CXF built-in interceptors.  Is there a way to
> > >
> > > globally set the encoding?
> > >
> > > I searched the FAQs and list archives and found nothing helpful.
> > >
> > > Thanks,
> > >
> > >   -Chris W.
> > > --------------------------------------------------------
> > >
> > > NOTICE: If received in error, please destroy and notify sender.
> > > Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
> >
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. Sender
> >
> > does not intend to waive confidentiality or privilege. Use of this
> > email is prohibited when received in error.
>
> --
> Daniel Kulp
> dkulp@apache.org
> http://www.dankulp.com/blog
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does
> not intend to waive confidentiality or privilege. Use of this email is
> prohibited when received in error.



-- 
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog

Re: Dealing with non-UTF-8 messages

Posted by Christian Schneider <ch...@die-schneider.net>.

How should the encoding issue be handled for JMS? As far as I know there 
is no content type header in JMS.

https://issues.apache.org/jira/browse/CXF-1668

Best regards

Christian


Daniel Kulp schrieb:
> Chris,
>
>
> On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
>   
>> From looking at the SOAP spec, it seems that it's the responsibilty
>> of the transport to indicate the encoding, as these samples show:
>>
>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>>
>> Note the "Content-Type:" HTTP header.
>>
>> Also note that the SOAP spec explicitly states that SOAP messages MUST
>> NOT
>> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
>> declaration)
>>
>> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>>
>>
>> I wonder if we change our client to set the outboud content-type header
>> to
>> indicate the encoding, if this will fix it.  I will look into this
>> angle.
>>     
>
> Yes, that is the proper fix.   The charset in the Content-Type header is what 
> is used to create the parser.   Thus, you need to make sure the client sets 
> the proper charset there.
>
> Dan
>
>
>

RE: Dealing with non-UTF-8 messages

Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.

We're using a JSP page with a declaration like:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"%>

...and textarea within a form, containing a template request document.

However, even after changing the browser's (FF-2, or IE) default 
character encoding to iso-8859-1 AND changing the FORM tag like:

<form action="./index.jsp" method="post" name="theForm" 
      enctype="application/x-www-form-urlencoded; charset=ISO-8859-1" 
      accept-charset="ISO-8859-1">
</form>

A sniffer shows that when the browser first loads the page (GET), it
sends,
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"

...but when the form is posted, it only sends:

"Content-type: application/x-www-form-urlencoded"

(note, it's without the "; charset=iso-8859-1" sub-value)

On the other hand, this w3.org document states that HTTP-1.1 implies a
default encoding of iso-8859-1, rather then utf-8, so even the above 
mentioned steps should not have been necessary, correct?
http://www.w3.org/International/O-HTTP-charset


Does this mean there's no way to test an iso-8859-1 request 
document containing accented characters using an HTML FORM POST?  
i.e. We can only use a programmatic client?  


Thanks,

   -Chris W.

-----Original Message-----
From: Daniel Kulp [mailto:dkulp@apache.org] 
Sent: Wednesday, August 20, 2008 11:23 AM
To: users@cxf.apache.org
Cc: Wolf, Chris (IT)
Subject: Re: Dealing with non-UTF-8 messages


Chris,


On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> From looking at the SOAP spec, it seems that it's the responsibilty of

> the transport to indicate the encoding, as these samples show:
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>
> Note the "Content-Type:" HTTP header.
>
> Also note that the SOAP spec explicitly states that SOAP messages MUST

> NOT (their emphasis) contain processing instructions (e.g. the 
> "<?xml...?>"
> declaration)
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>
>
> I wonder if we change our client to set the outboud content-type 
> header to indicate the encoding, if this will fix it.  I will look 
> into this angle.

Yes, that is the proper fix.   The charset in the Content-Type header is
what 
is used to create the parser.   Thus, you need to make sure the client
sets 
the proper charset there.

Dan


> Thanks,
>
>    -Chris W.
>
>
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Tuesday, August 19, 2008 10:16 PM
> To: users@cxf.apache.org
> Subject: Re: Dealing with non-UTF-8 messages
>
> This is not a current feature of CXF. If the messages had prologs with

> encodings, all would work.
>
> Better, put an interceptor at the front to transcode to UTF-8 or add 
> prologs?
>
> Can you tell us if this problem is specific to some app of yours or 
> more generic, to help motivate (or not) effort?
>
> On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
>
> <Ch...@morganstanley.com> wrote:
> > We now have some messages that contain iso-8859-1 (but non-ascii) 
> > characters, which apparently, are not UTF-8, e.g. accented 
> > characters used in Spanish and/or French.  These are causing 
> > exceptions citing invalid UTF-8 character sequences.
> >
> > I was able to eliminte one source by explicitly setting the encoding

> > to "iso-8859-1"
> > in the parser of one of our interceptors, but not the error simply 
> > migrated to one of the CXF built-in interceptors.  Is there a way to

> > globally set the encoding?
> >
> > I searched the FAQs and list archives and found nothing helpful.
> >
> > Thanks,
> >
> >   -Chris W.
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. 
> > Sender
>
> does not intend to waive confidentiality or privilege. Use of this 
> email is prohibited when received in error.
>
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender

> does not intend to waive confidentiality or privilege. Use of this 
> email is prohibited when received in error.



--
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re: Dealing with non-UTF-8 messages

Posted by Daniel Kulp <dk...@apache.org>.

Chris,


On Wednesday 20 August 2008 4:50:52 am Wolf, Chris (IT) wrote:
> From looking at the SOAP spec, it seems that it's the responsibilty
> of the transport to indicate the encoding, as these samples show:
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
>
> Note the "Content-Type:" HTTP header.
>
> Also note that the SOAP spec explicitly states that SOAP messages MUST
> NOT
> (their emphasis) contain processing instructions (e.g. the "<?xml...?>"
> declaration)
>
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492
>
>
> I wonder if we change our client to set the outboud content-type header
> to
> indicate the encoding, if this will fix it.  I will look into this
> angle.

Yes, that is the proper fix.   The charset in the Content-Type header is what 
is used to create the parser.   Thus, you need to make sure the client sets 
the proper charset there.

Dan


> Thanks,
>
>    -Chris W.
>
>
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Tuesday, August 19, 2008 10:16 PM
> To: users@cxf.apache.org
> Subject: Re: Dealing with non-UTF-8 messages
>
> This is not a current feature of CXF. If the messages had prologs with
> encodings, all would work.
>
> Better, put an interceptor at the front to transcode to UTF-8 or add
> prologs?
>
> Can you tell us if this problem is specific to some app of yours or more
> generic, to help motivate (or not) effort?
>
> On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
>
> <Ch...@morganstanley.com> wrote:
> > We now have some messages that contain iso-8859-1 (but non-ascii)
> > characters, which apparently, are not UTF-8, e.g. accented characters
> > used in Spanish and/or French.  These are causing exceptions citing
> > invalid UTF-8 character sequences.
> >
> > I was able to eliminte one source by explicitly setting the encoding
> > to "iso-8859-1"
> > in the parser of one of our interceptors, but not the error simply
> > migrated to one of the CXF built-in interceptors.  Is there a way to
> > globally set the encoding?
> >
> > I searched the FAQs and list archives and found nothing helpful.
> >
> > Thanks,
> >
> >   -Chris W.
> > --------------------------------------------------------
> >
> > NOTICE: If received in error, please destroy and notify sender. Sender
>
> does not intend to waive confidentiality or privilege. Use of this email
> is prohibited when received in error.
>
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does
> not intend to waive confidentiality or privilege. Use of this email is
> prohibited when received in error.



-- 
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog

Re: Dealing with non-UTF-8 messages

Posted by Ian Roberts <i....@dcs.shef.ac.uk>.

Wolf, Chris (IT) wrote:
>>>From looking at the SOAP spec, it seems that it's the responsibilty 
> of the transport to indicate the encoding, as these samples show:
> 
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490
> 
> Note the "Content-Type:" HTTP header. 
> 
> Also note that the SOAP spec explicitly states that SOAP messages MUST
> NOT
> (their emphasis) contain processing instructions (e.g. the "<?xml...?>" 
> declaration)
> 
> http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492

Technically, <?xml ...?> isn't a processing instruction.  It looks like
one, but it's actually a different kind of object called the "XML
declaration" and it is only allowed to appear at the very top of the
document, whereas true processing instructions (<?anything-except-xml
...?>) can appear anywhere.

http://www.w3.org/TR/REC-xml/#sec-prolog-dtd
http://www.w3.org/TR/REC-xml/#sec-pi

However, the parser is allowed to ignore the encoding specified in the
XML declaration if it has some external way to know what the encoding is
(such as the Content-type header).

http://www.w3.org/TR/REC-xml/#charencoding

But if the request had a content type like application/xml then it would
be the XML declaration that determines the encoding used.

Ian

-- 
Ian Roberts               | Department of Computer Science
i.roberts@dcs.shef.ac.uk  | University of Sheffield, UK

RE: Dealing with non-UTF-8 messages

Posted by "Wolf, Chris (IT)" <Ch...@morganstanley.com>.

>From looking at the SOAP spec, it seems that it's the responsibilty 
of the transport to indicate the encoding, as these samples show:

http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383490

Note the "Content-Type:" HTTP header. 

Also note that the SOAP spec explicitly states that SOAP messages MUST
NOT
(their emphasis) contain processing instructions (e.g. the "<?xml...?>" 
declaration)

http://www.w3.org/TR/2000/NOTE-SOAP-20000508/#_Toc478383492

I wonder if we change our client to set the outboud content-type header
to
indicate the encoding, if this will fix it.  I will look into this
angle.

Thanks,

   -Chris W.

-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com] 
Sent: Tuesday, August 19, 2008 10:16 PM
To: users@cxf.apache.org
Subject: Re: Dealing with non-UTF-8 messages

This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.

Better, put an interceptor at the front to transcode to UTF-8 or add
prologs?

Can you tell us if this problem is specific to some app of yours or more
generic, to help motivate (or not) effort?

On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii) 
> characters, which apparently, are not UTF-8, e.g. accented characters 
> used in Spanish and/or French.  These are causing exceptions citing 
> invalid UTF-8 character sequences.
>
> I was able to eliminte one source by explicitly setting the encoding 
> to "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply 
> migrated to one of the CXF built-in interceptors.  Is there a way to 
> globally set the encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
>   -Chris W.
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender
does not intend to waive confidentiality or privilege. Use of this email
is prohibited when received in error.
>
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re: Dealing with non-UTF-8 messages

Posted by Benson Margulies <bi...@gmail.com>.

This is not a current feature of CXF. If the messages had prologs with
encodings, all would work.

Better, put an interceptor at the front to transcode to UTF-8 or add prologs?

Can you tell us if this problem is specific to some app of yours or
more generic, to help motivate (or not) effort?

On Tue, Aug 19, 2008 at 4:47 PM, Wolf, Chris (IT)
<Ch...@morganstanley.com> wrote:
> We now have some messages that contain iso-8859-1 (but non-ascii)
> characters,
> which apparently, are not UTF-8, e.g. accented characters used in
> Spanish and/or
> French.  These are causing exceptions citing invalid UTF-8 character
> sequences.
>
> I was able to eliminte one source by explicitly setting the encoding to
> "iso-8859-1"
> in the parser of one of our interceptors, but not the error simply
> migrated to one
> of the CXF built-in interceptors.  Is there a way to globally set the
> encoding?
>
> I searched the FAQs and list archives and found nothing helpful.
>
> Thanks,
>
>   -Chris W.
> --------------------------------------------------------
>
> NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
>