You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by André Warnier <aw...@ice-sa.com> on 2009/02/07 22:30:56 UTC

[users@httpd] what is the charset of a URL ?

Hi.

I have been wondering for a while about how a server application should 
really consider the "query string" part of a URL, in terms of character 
encoding.  I am talking here of a URL of the form
http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
(the part after the question mark)

Starting with a quote from
http://www.w3.org/TR/html401/interact/forms.html#h-17.3 :

accept-charset = charset list [CI]
     This attribute specifies the list of character encodings for input 
data that is accepted by the server processing this form. The value is a 
space- and/or comma-delimited list of charset values. The client must 
interpret this list as an exclusive-or list, i.e., the server is able to 
accept any single character encoding per entity received.
     The default value for this attribute is the reserved string 
"UNKNOWN". User agents may interpret this value as the character 
encoding that was used to transmit the document containing this FORM 
element.

Some people (to which I belong), after trying to digest the various RFCs 
and other recommendations that seem to deal with the subject (e.g. 
RFC3986 and the document above), come to the conclusion that the 
character set and/or encoding of the query string, after 
percent-decoding, is basically undefined from a server's point of view.
Others seem to be convinced that it is Unicode encoded as UTF-8.
Yet others that it is, by default, iso-8859-1.

Now what is it ?
If I take the above quotation for instance, the part "User agents *may* 
interpret " (the emphasis is mine only) kind of bothers me, in the sense 
that it implies that the browser can do what it wants anyway.
The other part that bothers me is that according to the above, the 
"accept-charset" attribute can specify *a list* of character encodings, 
and not just one.
Then the above goes on to say "the server is able to accept any single 
character encoding per entity received". What in this case is an 
"entity" ? are we talking about the whole form submission, like in 
"query string", or are we talking individual data items, as in the 
individual "name=value" pairs ?

So basically, what will the browser pick, and how would the server know 
what it picked ?

One could argue that the server should only send forms as follows :
- the server response to the browser should contain a "Content-Type:" 
header that specifies not only the Mime type "text/html" (or 
equivalent), but add a "charset" attribute.
- the html document being sent should contain a <meta> tag that 
explicitly provides the document charset/encoding, like
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />.
- the <form> in the document should specify an "accept-charset" 
attribute, preferably with a single charset/encoding like "utf-8".

That's all nice and well, but

a) if this incoming URL is something typed by a user in the URL bar of 
the browser, there is no such previous response sent by the server.
b) HTTP being a connection-less protocol, the server should anyway not 
have any recollection that it has previously sent such a form to the 
same browser (yesterday ?), so when a request comes in, the server 
doesn't know any of these things above for sure
c) the browser may decide to do whatever it pleases and disregard what 
the server told it (IE comes to mind, practical examples on request).
It should then be in violation of the specifications, but considering 
the above I'm not so sure it is clear-cut.

For a while now, I have resorted to do all the things above, and in 
addition to always sending forms specifying 
"enctype=multipart/form-data", for which the problem should not exist.
In addition, I make sure that each form contains a hidden field, itself 
containing a string with a content known to the application, which upon 
form submission can be checked for any discrepancy (at least between 
UTF-8 and an ISO-8859 encoding; it can unfortunately not distinguish 
between different iso-8859 encodings).

But that seems like some hideous overkill, and still not totally foolproof.
(multipart/form-data also has the inconvenient that it does not play 
very well with some authentication schemes using redirects)

It seems to me that the specifications are still not clear and/or not 
tight enough.

Am I missing something ?

(And yes I know about PUNYCODE, but in my understanding that applies to 
DNS hostnames, not to query strings.)





---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] what is the charset of a URL ?

Posted by Nick Kew <ni...@webthing.com>.
On 7 Feb 2009, at 21:30, André Warnier wrote:

> Hi.
>
> I have been wondering for a while about how a server application  
> should really consider the "query string" part of a URL, in terms  
> of character encoding.  I am talking here of a URL of the form
> http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
> (the part after the question mark)

This question crops up on apache lists from time to time - check  
archives.
Basically:
   - it's underspecified in the specs - hence the need for your  
question.
   - in practice, in the case of HTML forms and form submissions,
     browsers will use the charset of the form.  But that's empirical,
     and could break down if a browser doesn't support a charset.
   - there are various standards (e.g. HTML which you cite, and XML)
     that say something on the subject.  But if you generalise any
     one of them, it'll conflict with another.

> That's all nice and well, but
>
> a) if this incoming URL is something typed by a user in the URL bar  
> of the browser, there is no such previous response sent by the server.

A user typing thusly is interacting on his own terms with your  
application.
It's up to them to be compatible - whatever that is.

> b) HTTP being a connection-less protocol, the server should anyway  
> not have any recollection that it has previously sent such a form  
> to the same browser (yesterday ?), so when a request comes in, the  
> server doesn't know any of these things above for sure

But it need only be designed to work with its own pages,

> c) the browser may decide to do whatever it pleases and disregard  
> what the server told it (IE comes to mind, practical examples on  
> request).

I haven't heard of IE screwing up charsets in HTML forms.  But ICBW.

> It should then be in violation of the specifications, but  
> considering the above I'm not so sure it is clear-cut.
>
> For a while now, I have resorted to do all the things above, and in  
> addition to always sending forms specifying "enctype=multipart/form- 
> data", for which the problem should not exist.

Um, it just moves to the charset of the form parts!

-- 
Nick Kew
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] what is the charset of a URL ?

Posted by Sean Conner <sp...@conman.org>.
It was thus said that the Great Andr Warnier once stated:
> Hi.
> 
> Some people (to which I belong), after trying to digest the various RFCs 
> and other recommendations that seem to deal with the subject (e.g. 
> RFC3986 and the document above), come to the conclusion that the 
> character set and/or encoding of the query string, after 
> percent-decoding, is basically undefined from a server's point of view.
> Others seem to be convinced that it is Unicode encoded as UTF-8.
> Yet others that it is, by default, iso-8859-1.
> 
> Now what is it ?

  Whatever the browser wants, although Firefox may use the character
encoding the page was sent as (and the HTML spec---or was it the HTTP spec?
says the default is UTF-8).

> If I take the above quotation for instance, the part "User agents *may* 
> interpret " (the emphasis is mine only) kind of bothers me, in the sense 
> that it implies that the browser can do what it wants anyway.
> The other part that bothers me is that according to the above, the 
> "accept-charset" attribute can specify *a list* of character encodings, 
> and not just one.
> Then the above goes on to say "the server is able to accept any single 
> character encoding per entity received". What in this case is an 
> "entity" ? are we talking about the whole form submission, like in 
> "query string", or are we talking individual data items, as in the 
> individual "name=value" pairs ?

  From playing around with it, it seems to apply to the entire submission,
in that all the name/value pairs are encoded in a single character set.

> But that seems like some hideous overkill, and still not totally foolproof.
> (multipart/form-data also has the inconvenient that it does not play 
> very well with some authentication schemes using redirects)
> 
> It seems to me that the specifications are still not clear and/or not 
> tight enough.
> 
> Am I missing something ?

  I don't think so.  I think I ended up writing a CGI script to assume
UTF-8, and if it encountered a problem, switch to ISO-8859-1 and then
Windows-1251 (or some combination---it's around here somewhere).  I used the
GNU iconv library (at our company we use Linux, so it's easy to install and
use) to do the conversions.

  Messy, but it's about the best you can do.

  -spc (Even with the 'accept-charset' attribute, there may be some
	user-agent out there that doesn't support it, so you're
	still screwed ... )


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org