You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-dev@jakarta.apache.org by Sung-Gu <je...@apache.org> on 2002/03/21 19:18:19 UTC

Re: [HttpClient]Encoding

I'm sure that you guys're talking about character set(= character encoding in MIME) in HTTP.  I added my comment below.  ;)


Sung-Gu

----- Original Message ----- 
Subject: Re: [HttpClient]Encoding


> I'll see about changing getResponseBodyAsString() to use the charset from
> the content-type (if it exists).  I'm up to my ears with day job work right
> now, so it'll probably be a while before I can get to it.

I think we'll need to support language tags (within the Accept-Language and Content-Language fields) and Accept and Content-Type (for internet media types) at some point.

> 
> People still need to understand (and I'll improve the JavaDoc) that
> getResponseBodyAsString() is never really going to be all that useful in the
> real world.  From HttpClient's perspective the response body is simply a
> sequence of bytes, nothing more.  It is up to a higher application layer to
> actually *interpret* those bytes based on the mime type specified in the
> content-type header.
> 
> Marc Saegesser 
> 
> > -----Original Message-----
> > From: Rapheal Kaplan [mailto:rafe@mimir.net]
> > Sent: Wednesday, March 20, 2002 1:53 PM
> > To: Jakarta Commons Developers List
> > Subject: Re: [HttpClient]Encoding
> > 
> > 
> >   Makes sense to me.  Because the encoding is handled in the 
> > body itself, it 
> > doesn't necessarily help that much to set the encoding in the 
> > getResponseBodyAsString method.  Also, this kind of means 
> > that you can't rely 
> > on the getResponseBodyAsString method for all purposes.  
> > There needs to be 
> > some other layer of a client application that manages encoding.
> > 
> >   I still see the use of get...AsString, of course.  It could 
> > be an inbetween 
> > step that is sent to a parser to determine actual encoding, 
> > but then you 
> > would need to return to the original byte stream anyway to 
> > re-string the 
> > body.  Maybe the documentation should reflect this information.
> > 
> >   Also, if people start using charset info in the future, it 
> > would probably 
> > be nice to provide support.  It might be that doing body to 
> > string conversion 
> > should be somewhere else in the API.  Any ideas?
> >
> >   My first guess would be to have a utility class that can do 
> > the correct 
> > encoding, from both the header and maybe even parsing the 
> > content.  However, 
> > I don't think I am framiliar enough with the API to say decisivly.
> > 
> >   I do know that such features might be very useful for some work 
> > that I need to do in the near future.  I am working one 
> > software that needs 
> > to interact with several languages with non-latin character sets.

In your pre-mail,
> For example, if the client is requesting a document written 
> in Chinese, it 
> could well use an entirely different encoding.

if you want to solve this problem in the only perspective of character encoding,
you should consider of the conversion from/to  local character set to/from transfer character set in the client/server side. 

We can go more complicately!  
If you use mixed non-ascii characters (Korean and Chinese... ), you should provide to handle to bi-directional display for these character sets.   Then you should take a two step process for conversion from/to local character set to/from UTF-8?  First, convert the local character set to the UCS.  Second, convert UCS to UTF-8. How complicated, huh?

And one more!
Some old clients or servers doesn't support 8 bit transfer encoding like UTF-8. Then what?  We should check that the code is valid UTF-8 or not.


However, there is an eaiser way to solve this problem. 
( I WANT to say this a bit!  ^^ )
That's to use "escaped encoding" that includes ASCII character set only.
It looks like application/x-www-form-urlencoded for media type in HTML.
But it's somewhat different.

> > 
> >   - Rapheal Kaplan
> > 
> > 
> > 
> > On Wednesday 20 March 2002 14:27, you wrote:
> > > I've had to deal with this problem myself.  Right now the 
> > only solution is
> > > to use getResponseBody() and convert bytes into a string using the
> > > appropriate encoding.  I like the idea of having 
> > getResponseBodyAsString()
> > > use the encoding specified in the Content-Type header, but 
> > the problem is
> > > that it still won't be very useful.
> > >
> > > The vast majority of web servers out there don't include a 
> > "; charset="
> > > attribute in the content-type header or provide a 
> > reasonable mechanism for
> > > content authors to cause the server to set the attribute 
> > correctly on a
> > > per-file basis.  Most pages with non-ISO-LATIN-1 charsets use <META
> > > HTTP-EQUIV> tag in the HTML header to specify the page 
> > encoding.  That
> > > means you still have to read at least part of the response body (as
> > > ISO-LATIN-1) in order to determine the correct encoding.
> > >
> > > I don't have a problem with changing 
> > getResponseBodyAsString() to check the
> > > content-type header, I just doubt that doing that will make 
> > it much more
> > > useful in the real world.
> > >
> > > What do others think?
> > >
> > > Marc Saegesser
> > >
> > 
> 
> 
> 

RE: [HttpClient]Encoding

Posted by Rapheal Kaplan <ra...@mimir.net>.
  I think I understand why Marc wanted to leave that level of support
outside the HttpClient API.  As long as the client deals strictly with
binary streams there is no reason why a higher level part of the application
can't handle the encoding issues.

  In essence, Marc has said that the fact that the API doesn't handle
encoding is intentional, and the ...AsString methos is not really meant to
be used for proper display encoding.  I think that in order to handle the
enocoding properly the way that Marc described would require changes to the
actual API, and is really best handled somewhere else.

  - Rapheal Kaplan

-----Original Message-----
From: Sung-Gu [mailto:jericho@apache.org]
Sent: Thursday, March 21, 2002 1:18 PM
To: commons-dev@jakarta.apache.org
Cc: Slide Developers Mailing List
Subject: Re: [HttpClient]Encoding



I'm sure that you guys're talking about character set(= character encoding
in MIME) in HTTP.  I added my comment below.  ;)


Sung-Gu

----- Original Message -----
Subject: Re: [HttpClient]Encoding


> I'll see about changing getResponseBodyAsString() to use the charset from
> the content-type (if it exists).  I'm up to my ears with day job work
right
> now, so it'll probably be a while before I can get to it.

I think we'll need to support language tags (within the Accept-Language and
Content-Language fields) and Accept and Content-Type (for internet media
types) at some point.

>
> People still need to understand (and I'll improve the JavaDoc) that
> getResponseBodyAsString() is never really going to be all that useful in
the
> real world.  From HttpClient's perspective the response body is simply a
> sequence of bytes, nothing more.  It is up to a higher application layer
to
> actually *interpret* those bytes based on the mime type specified in the
> content-type header.
>
> Marc Saegesser
>
> > -----Original Message-----
> > From: Rapheal Kaplan [mailto:rafe@mimir.net]
> > Sent: Wednesday, March 20, 2002 1:53 PM
> > To: Jakarta Commons Developers List
> > Subject: Re: [HttpClient]Encoding
> >
> >
> >   Makes sense to me.  Because the encoding is handled in the
> > body itself, it
> > doesn't necessarily help that much to set the encoding in the
> > getResponseBodyAsString method.  Also, this kind of means
> > that you can't rely
> > on the getResponseBodyAsString method for all purposes.
> > There needs to be
> > some other layer of a client application that manages encoding.
> >
> >   I still see the use of get...AsString, of course.  It could
> > be an inbetween
> > step that is sent to a parser to determine actual encoding,
> > but then you
> > would need to return to the original byte stream anyway to
> > re-string the
> > body.  Maybe the documentation should reflect this information.
> >
> >   Also, if people start using charset info in the future, it
> > would probably
> > be nice to provide support.  It might be that doing body to
> > string conversion
> > should be somewhere else in the API.  Any ideas?
> >
> >   My first guess would be to have a utility class that can do
> > the correct
> > encoding, from both the header and maybe even parsing the
> > content.  However,
> > I don't think I am framiliar enough with the API to say decisivly.
> >
> >   I do know that such features might be very useful for some work
> > that I need to do in the near future.  I am working one
> > software that needs
> > to interact with several languages with non-latin character sets.

In your pre-mail,
> For example, if the client is requesting a document written
> in Chinese, it
> could well use an entirely different encoding.

if you want to solve this problem in the only perspective of character
encoding,
you should consider of the conversion from/to  local character set to/from
transfer character set in the client/server side.

We can go more complicately!
If you use mixed non-ascii characters (Korean and Chinese... ), you should
provide to handle to bi-directional display for these character sets.   Then
you should take a two step process for conversion from/to local character
set to/from UTF-8?  First, convert the local character set to the UCS.
Second, convert UCS to UTF-8. How complicated, huh?

And one more!
Some old clients or servers doesn't support 8 bit transfer encoding like
UTF-8. Then what?  We should check that the code is valid UTF-8 or not.


However, there is an eaiser way to solve this problem.
( I WANT to say this a bit!  ^^ )
That's to use "escaped encoding" that includes ASCII character set only.
It looks like application/x-www-form-urlencoded for media type in HTML.
But it's somewhat different.

> >
> >   - Rapheal Kaplan
> >
> >
> >
> > On Wednesday 20 March 2002 14:27, you wrote:
> > > I've had to deal with this problem myself.  Right now the
> > only solution is
> > > to use getResponseBody() and convert bytes into a string using the
> > > appropriate encoding.  I like the idea of having
> > getResponseBodyAsString()
> > > use the encoding specified in the Content-Type header, but
> > the problem is
> > > that it still won't be very useful.
> > >
> > > The vast majority of web servers out there don't include a
> > "; charset="
> > > attribute in the content-type header or provide a
> > reasonable mechanism for
> > > content authors to cause the server to set the attribute
> > correctly on a
> > > per-file basis.  Most pages with non-ISO-LATIN-1 charsets use <META
> > > HTTP-EQUIV> tag in the HTML header to specify the page
> > encoding.  That
> > > means you still have to read at least part of the response body (as
> > > ISO-LATIN-1) in order to determine the correct encoding.
> > >
> > > I don't have a problem with changing
> > getResponseBodyAsString() to check the
> > > content-type header, I just doubt that doing that will make
> > it much more
> > > useful in the real world.
> > >
> > > What do others think?
> > >
> > > Marc Saegesser
> > >
> >
>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>