You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by André Warnier <aw...@ice-sa.com> on 2008/10/18 19:02:17 UTC
I18N, HTTP 2.0 ?
Hi.
I am sending this to both the Apache httpd and Tomcat users lists, in
the hope that because together these HTTP servers cover a good fraction
of the market, there might be a chance to reach the righ people.
My hope is that someone who is aware of, and connected to, the process
of RFC generation would pick this up, or else inform us if some process
in the direction that I am indicating below is already under way.
I apologise in advance if I am crashing an open door. If so, I would
gladly accept to be informed about what the state of affairs is.
(A Google search on the terms "HTTP" and "RFC" and "UTF-8" does not seem
to yeld any relevant results.)
Proposal :
It is becoming urgent to create a new HTTP standard/version/revision,
that would be organised around Unicode as a default character set, and
UTF-8 as a default encoding.
I believe that the spread and acceptance of Unicode and UTF-8 is now
sufficient to warrant such an evolution.
The current situation, where iso-8859-1 is the default in some areas,
and some other areas are either unspecified or vague, creates a lot of
confusion and inefficiencies, and creates barriers to the creation of
truly international HTTP-based WWW applications.
Here are some areas where these problems appear :
- the encoding of URLs.
- the encoding of HTTP headers.
- the encoding of user credentials in browser-side Basic and Digest
authentication dialogs, and their transmission to the server.
- the encoding of input elements from html forms, as transmitted by a
client to a server, and the interpretation of ditto data by the server
I am quite sure that I am forgetting some aspects of the same issue.
For each of the above, there are areas where there is no specification,
or areas where there are vague specifications, or areas where there are
multiple apparently-contradictory specifications.
Consequently, there is a profusion of ad-hoc tricks and receipes, and
there start to appear various "parameters" and "flags" and "settings" at
the client and server level, which may help resolving the issues in some
cases, but which in the long term create even more confusion and
problems of interoperability.
(example of a setting : "use body encoding for URL").
There might be some efforts under way to tackle one or the other aspect
of the above (I have heard of a proposal regarding HTTP headers), but I
honestly believe that this issue can only be resolved well "at the top",
which seems to me the HTTP protocol itself.
Thanks
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: [users@httpd] I18N, HTTP 2.0?
Posted by Daniel Aleksandersen <al...@runbox.com>.
----- Start Opprinnelig Melding -----
Sendt: Sat, 18 Oct 2008 19:02:17 +0200
Fra: André Warnier <aw...@ice-sa.com>
Til: users@httpd.apache.org, Tomcat Users List <us...@tomcat.apache.org>
Emne: [users@httpd] I18N, HTTP 2.0 ?
> I am sending this to both the Apache httpd and Tomcat users lists, in
> the hope that because together these HTTP servers cover a good fraction
> of the market, there might be a chance to reach the righ people.
>
> My hope is that someone who is aware of, and connected to, the process
> of RFC generation would pick this up, or else inform us if some process
> in the direction that I am indicating below is already under way.
>
> I apologise in advance if I am crashing an open door. If so, I would
> gladly accept to be informed about what the state of affairs is.
> (A Google search on the terms "HTTP" and "RFC" and "UTF-8" does not seem
> to yeld any relevant results.)
>
> Proposal :
>
> It is becoming urgent to create a new HTTP standard/version/revision,
> that would be organised around Unicode as a default character set, and
> UTF-8 as a default encoding.
>
> I believe that the spread and acceptance of Unicode and UTF-8 is now
> sufficient to warrant such an evolution.
>
> The current situation, where iso-8859-1 is the default in some areas,
> and some other areas are either unspecified or vague, creates a lot of
> confusion and inefficiencies, and creates barriers to the creation of
> truly international HTTP-based WWW applications.
>
> Here are some areas where these problems appear :
> - the encoding of URLs.
> - the encoding of HTTP headers.
> - the encoding of user credentials in browser-side Basic and Digest
> authentication dialogs, and their transmission to the server.
> - the encoding of input elements from html forms, as transmitted by a
> client to a server, and the interpretation of ditto data by the server
>
> I am quite sure that I am forgetting some aspects of the same issue.
>
> For each of the above, there are areas where there is no specification,
> or areas where there are vague specifications, or areas where there are
> multiple apparently-contradictory specifications.
> Consequently, there is a profusion of ad-hoc tricks and receipes, and
> there start to appear various "parameters" and "flags" and "settings" at
> the client and server level, which may help resolving the issues in some
> cases, but which in the long term create even more confusion and
> problems of interoperability.
> (example of a setting : "use body encoding for URL").
>
> There might be some efforts under way to tackle one or the other aspect
> of the above (I have heard of a proposal regarding HTTP headers), but I
> honestly believe that this issue can only be resolved well "at the top",
> which seems to me the HTTP protocol itself.
----- Slutt Opprinnelig Melding -----
I just want to say that I agree with you in recommending UTF-8 as the default character encoding. It has been a natural evolution toward richer character sets, but the HTTP (and other) standards have not followed this evolution.
I doubt, however, that the HTTP---one for the web's core protocol---will be revised just to make room for internationalisation. More needs need to be addressed at the same time to make something happen in this area.
Personally I would want to see the HTTP user-error 402 (Payment Required) specified in the upcoming specs. There are so many for-pay web-sites/services around that this should have been specified a long time ago.
--
Daniel Aleksandersen
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by Nick Kew <ni...@webthing.com>.
William A. Rowe, Jr. wrote:
> Nick Kew wrote:
>>> So what does the HTML spec have to say? The <FORM > submission element
>>> does include the accept-charset attribute, perhaps that is what you are
>>> looking for? Otherwise, if the user agents don't observe RFC 2388 then
>>> you should really take that up with the user agent vendors.
>> This became a (relatively) frequent complaint with mod_proxy_html 2.x, and
>> one of the motivations behind the updates in 3.0.
>>
>> The issue: libxml2 uses utf-8 internally. When presented with a different charset,
>> mod_proxy_html has to convert (or setup the parser to convert internally), and
>> mod_proxy_html 2.x always generates output as utf-8.
>
> Right - using an xml parser for sgml has several interesting side effects :)
HTMLparser parses HTML and XHTML. And, more to the point in real
life, it parses tag-soup.
> So just out of curiosity, the module always emits the charset=utf-8 property
> for the request body content-type? Tomcat, for example, should parse such
> request bodies with no issue. Only non-utf-8 aware, custom applications
> that don't a charset-aware parser should fail.
Nope, the module doesn't touch requests. It only process responses.
--
Nick Kew
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Nick Kew wrote:
>
>> So what does the HTML spec have to say? The <FORM > submission element
>> does include the accept-charset attribute, perhaps that is what you are
>> looking for? Otherwise, if the user agents don't observe RFC 2388 then
>> you should really take that up with the user agent vendors.
>
> This became a (relatively) frequent complaint with mod_proxy_html 2.x, and
> one of the motivations behind the updates in 3.0.
>
> The issue: libxml2 uses utf-8 internally. When presented with a different charset,
> mod_proxy_html has to convert (or setup the parser to convert internally), and
> mod_proxy_html 2.x always generates output as utf-8.
Right - using an xml parser for sgml has several interesting side effects :)
> The complaint was that when this happens to a page containing a <form>,
> it would cause browsers to submit the form data as utf-8, which in turn
> screwed up some peoples applications. It's not a problem I've had myself,
> but a few users made the case coherently, so I felt compelled to fix it by
> enabling the user to specify an output charset of choice.
So just out of curiosity, the module always emits the charset=utf-8 property
for the request body content-type? Tomcat, for example, should parse such
request bodies with no issue. Only non-utf-8 aware, custom applications
that don't a charset-aware parser should fail.
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by Nick Kew <ni...@webthing.com>.
On 18 Oct 2008, at 20:22, William A. Rowe, Jr. wrote:
> Comments inline; You painted this situation today with an overly
> broad brush,
> there are some remaining issues but they are much narrower than you
> identify
> below...
You seem to have summed up the issues. Just to expand with a
practical point.
> So what does the HTML spec have to say? The <FORM > submission
> element
> does include the accept-charset attribute, perhaps that is what you
> are
> looking for? Otherwise, if the user agents don't observe RFC 2388
> then
> you should really take that up with the user agent vendors.
This became a (relatively) frequent complaint with mod_proxy_html
2.x, and
one of the motivations behind the updates in 3.0.
The issue: libxml2 uses utf-8 internally. When presented with a
different charset,
mod_proxy_html has to convert (or setup the parser to convert
internally), and
mod_proxy_html 2.x always generates output as utf-8.
The complaint was that when this happens to a page containing a <form>,
it would cause browsers to submit the form data as utf-8, which in turn
screwed up some peoples applications. It's not a problem I've had
myself,
but a few users made the case coherently, so I felt compelled to fix
it by
enabling the user to specify an output charset of choice.
So as you say, there is an issue, but I think this is indeed the
extent of it.
--
Nick Kew
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by André Warnier <aw...@ice-sa.com>.
William, Nick,
William A. Rowe, Jr. wrote:
[...]
>
> The reason your search was futile is that you want to focus on searching
> internet-draft where there are proposals in this sphere. Also watch the
> dependencies of the http draft, many of those have also evolved and are
> beginning to solve the utf8 situation.
>
You may be right, and maybe my broad brush was due to a lack of
information. I am an experienced searcher, but I may have been looking
in the wrong places. If there exist solutions or proposals in this
area that I failed to find, could you provide some relevant links ?
Thanks.
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by André Warnier <aw...@ice-sa.com>.
William, Nick,
William A. Rowe, Jr. wrote:
[...]
>
> The reason your search was futile is that you want to focus on searching
> internet-draft where there are proposals in this sphere. Also watch the
> dependencies of the http draft, many of those have also evolved and are
> beginning to solve the utf8 situation.
>
You may be right, and maybe my broad brush was due to a lack of
information. I am an experienced searcher, but I may have been looking
in the wrong places. If there exist solutions or proposals in this
area that I failed to find, could you provide some relevant links ?
Thanks.
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org
Re: [users@httpd] I18N, HTTP 2.0 ?
Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Comments inline; You painted this situation today with an overly broad brush,
there are some remaining issues but they are much narrower than you identify
below...
André Warnier wrote:
>
> It is becoming urgent to create a new HTTP standard/version/revision,
> that would be organised around Unicode as a default character set, and
> UTF-8 as a default encoding.
The reason your search was futile is that you want to focus on searching
internet-draft where there are proposals in this sphere. Also watch the
dependencies of the http draft, many of those have also evolved and are
beginning to solve the utf8 situation.
> Here are some areas where these problems appear :
> - the encoding of URLs.
That is not a problem. URLs are essentially ASCII and the high order bit
byte domain is undefined. So from a presentation perspective it can be
a problem, but technically and operationally this is not. The only way
to represent URLs in the spirit of their design is to % encode the high
bit characters for presentation. They can be UTF-8 or ISO-8859-1 (not
either-or, but the administrator's choice) and are easily typed in from
hardcopy (e.g. the tag on a TV commercial) by anyone using any character
set who has access to the ASCII subset.
Using "UTF-8" alone is not enough; to accept arbitrary characters is
to ignore the fact that there are multiple representations, often not
entirely synonymous, from visual references which are entered by the
user. It's to ignore the issue of canonical forms when we are lucky
enough to have an astute reader. So % encoding is the only safe data
entry format from the sensory world to the browser url bar.
> - the encoding of HTTP headers.
Headers? I hope you mean header values. *TEXT values clearly declare
how to shift to utf-8, but there's an ongoing discussion of how to fix
or broaden or clarify this on the http-wg list.
> - the encoding of user credentials in browser-side Basic and Digest
> authentication dialogs, and their transmission to the server.
Is a side effect of the HTTP headers question, and further it's a
UI design issue.
> - the encoding of input elements from html forms, as transmitted by a
> client to a server, and the interpretation of ditto data by the server
The RFC2616 http spec is clear on this and needs no further clarification.
7.2 Entity Body
The entity-body (if any) sent with an HTTP request or response is in
a format and encoding defined by the entity-header fields.
entity-body = *OCTET
An entity-body is only present in a message when a message-body is
present, as described in section 4.3. The entity-body is obtained
from the message-body by decoding any Transfer-Encoding that might
have been applied to ensure safe and proper transfer of the message.
7.2.1 Type
When an entity-body is included with a message, the data type of that
body is determined via the header fields Content-Type and Content-
Encoding. These define a two-layer, ordered encoding model:
entity-body := Content-Encoding( Content-Type( data ) )
And RFC2388 multipart/form-data spec is completely clear on this...
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable
Joe owes =80100.
--AaB03x
So what does the HTML spec have to say? The <FORM > submission element
does include the accept-charset attribute, perhaps that is what you are
looking for? Otherwise, if the user agents don't observe RFC 2388 then
you should really take that up with the user agent vendors.
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org