You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hc.apache.org by Adrian Sutton <ad...@ephox.com> on 2003/03/07 10:10:45 UTC

Character Encodings

Hi all,
I'm not too certain about all the details of character encodings in
HttpClient but it is on my list of docs to write so would like to confirm a
few things and extract any thoughts about it.

1. URLs should only consist of ISO-8859-1 characters whenever possible as
this is the encoding used by RFC 1738 and using other encodings may cause
compatibility issues with some servers (eg: Windows Web Folders).  This is
mostly due to the fact that there is no way to determine the encoding used
for the URL.

2. The headers of a HTTP request/response must always be ISO 8859-1 (or is
this ASCII?) as per the HTTP standard.

3. The Content-Type: header may specify a charset for the body of the HTTP
request/response, eg: Content-Type: text/html; charset=UTF-8

4. Is there any simple way to extract the charset returned by the server
from HttpClient?  If not we probably should add one.  Obviously you could
get the Content-Type header and parse it but since HttpClient already does
this (I think) it would be better to avoid it.

5. getResponseBodyAsString always uses the platform default encoding.  Why
doesn't this use the charset specified in the HTTP request?

6. Some document types specify the charset inside the document itself, you
should consult the appropriate standards to determine whether to use the
charset specified in the HTTP response or the charset in the document.

Any other things that should be documented would be good to know as well.

Thanks in advance.

Adrian "Doc Boy" Sutton.