You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by André Warnier <aw...@ice-sa.com> on 2008/10/18 19:02:17 UTC

I18N, HTTP 2.0 ?

Hi.

I am sending this to both the Apache httpd and Tomcat users lists, in 
the hope that because together these HTTP servers cover a good fraction 
of the market, there might be a chance to reach the righ people.

My hope is that someone who is aware of, and connected to, the process 
of RFC generation would pick this up, or else inform us if some process 
in the direction that I am indicating below is already under way.

I apologise in advance if I am crashing an open door.  If so, I would 
gladly accept to be informed about what the state of affairs is.
(A Google search on the terms "HTTP" and "RFC" and "UTF-8" does not seem 
to yeld any relevant results.)

Proposal :

It is becoming urgent to create a new HTTP standard/version/revision, 
that would be organised around Unicode as a default character set, and 
UTF-8 as a default encoding.

I believe that the spread and acceptance of Unicode and UTF-8 is now 
sufficient to warrant such an evolution.

The current situation, where iso-8859-1 is the default in some areas, 
and  some other areas are either unspecified or vague, creates a lot of 
confusion and inefficiencies, and creates barriers to the creation of 
truly international HTTP-based WWW applications.

Here are some areas where these problems appear :
- the encoding of URLs.
- the encoding of HTTP headers.
- the encoding of user credentials in browser-side Basic and Digest 
authentication dialogs, and their transmission to the server.
- the encoding of input elements from html forms, as transmitted by a 
client to a server, and the interpretation of ditto data by the server

I am quite sure that I am forgetting some aspects of the same issue.

For each of the above, there are areas where there is no specification, 
or areas where there are vague specifications, or areas where there are 
multiple apparently-contradictory specifications.
Consequently, there is a profusion of ad-hoc tricks and receipes, and 
there start to appear various "parameters" and "flags" and "settings" at 
the client and server level, which may help resolving the issues in some 
cases, but which in the long term create even more confusion and 
problems of interoperability.
(example of a setting : "use body encoding for URL").

There might be some efforts under way to tackle one or the other aspect 
of the above (I have heard of a proposal regarding HTTP headers), but I 
honestly believe that this issue can only be resolved well "at the top", 
which seems to me the HTTP protocol itself.

Thanks

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: [users@httpd] I18N, HTTP 2.0?

Posted by Daniel Aleksandersen <al...@runbox.com>.
----- Start Opprinnelig Melding -----
Sendt: Sat, 18 Oct 2008 19:02:17 +0200
Fra: André Warnier <aw...@ice-sa.com>
Til: users@httpd.apache.org, Tomcat Users List <us...@tomcat.apache.org>
Emne: [users@httpd] I18N, HTTP 2.0 ?

> I am sending this to both the Apache httpd and Tomcat users lists, in 
> the hope that because together these HTTP servers cover a good fraction 
> of the market, there might be a chance to reach the righ people.
> 
> My hope is that someone who is aware of, and connected to, the process 
> of RFC generation would pick this up, or else inform us if some process 
> in the direction that I am indicating below is already under way.
> 
> I apologise in advance if I am crashing an open door.  If so, I would 
> gladly accept to be informed about what the state of affairs is.
> (A Google search on the terms "HTTP" and "RFC" and "UTF-8" does not seem 
> to yeld any relevant results.)
> 
> Proposal :
> 
> It is becoming urgent to create a new HTTP standard/version/revision, 
> that would be organised around Unicode as a default character set, and 
> UTF-8 as a default encoding.
> 
> I believe that the spread and acceptance of Unicode and UTF-8 is now 
> sufficient to warrant such an evolution.
> 
> The current situation, where iso-8859-1 is the default in some areas, 
> and  some other areas are either unspecified or vague, creates a lot of 
> confusion and inefficiencies, and creates barriers to the creation of 
> truly international HTTP-based WWW applications.
> 
> Here are some areas where these problems appear :
> - the encoding of URLs.
> - the encoding of HTTP headers.
> - the encoding of user credentials in browser-side Basic and Digest 
> authentication dialogs, and their transmission to the server.
> - the encoding of input elements from html forms, as transmitted by a 
> client to a server, and the interpretation of ditto data by the server
> 
> I am quite sure that I am forgetting some aspects of the same issue.
> 
> For each of the above, there are areas where there is no specification, 
> or areas where there are vague specifications, or areas where there are 
> multiple apparently-contradictory specifications.
> Consequently, there is a profusion of ad-hoc tricks and receipes, and 
> there start to appear various "parameters" and "flags" and "settings" at 
> the client and server level, which may help resolving the issues in some 
> cases, but which in the long term create even more confusion and 
> problems of interoperability.
> (example of a setting : "use body encoding for URL").
> 
> There might be some efforts under way to tackle one or the other aspect 
> of the above (I have heard of a proposal regarding HTTP headers), but I 
> honestly believe that this issue can only be resolved well "at the top", 
> which seems to me the HTTP protocol itself.
----- Slutt Opprinnelig Melding -----

I just want to say that I agree with you in recommending UTF-8 as the default character encoding. It has been a natural evolution toward richer character sets, but the HTTP (and other) standards have not followed this evolution.

I doubt, however, that the HTTP---one for the web's core protocol---will be revised just to make room for internationalisation. More needs need to be addressed at the same time to make something happen in this area.

Personally I would want to see the HTTP user-error 402 (Payment Required) specified in the upcoming specs. There are so many for-pay web-sites/services around that this should have been specified a long time ago.
-- 
Daniel Aleksandersen

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by Nick Kew <ni...@webthing.com>.
William A. Rowe, Jr. wrote:
> Nick Kew wrote:
>>> So what does the HTML spec have to say?  The <FORM > submission element
>>> does include the accept-charset attribute, perhaps that is what you are
>>> looking for?  Otherwise, if the user agents don't observe RFC 2388 then
>>> you should really take that up with the user agent vendors.
>> This became a (relatively) frequent complaint with mod_proxy_html 2.x, and
>> one of the motivations behind the updates in 3.0.
>>
>> The issue: libxml2 uses utf-8 internally.  When presented with a different charset,
>> mod_proxy_html has to convert (or setup the parser to convert internally), and
>> mod_proxy_html 2.x always generates output as utf-8.
> 
> Right - using an xml parser for sgml has several interesting side effects :)

HTMLparser parses HTML and XHTML.  And, more to the point in real
life, it parses tag-soup.

> So just out of curiosity, the module always emits the charset=utf-8 property
> for the request body content-type?  Tomcat, for example, should parse such
> request bodies with no issue.  Only non-utf-8 aware, custom applications
> that don't a charset-aware parser should fail.

Nope, the module doesn't touch requests.  It only process responses.

-- 
Nick Kew

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Nick Kew wrote:
> 
>> So what does the HTML spec have to say?  The <FORM > submission element
>> does include the accept-charset attribute, perhaps that is what you are
>> looking for?  Otherwise, if the user agents don't observe RFC 2388 then
>> you should really take that up with the user agent vendors.
> 
> This became a (relatively) frequent complaint with mod_proxy_html 2.x, and
> one of the motivations behind the updates in 3.0.
> 
> The issue: libxml2 uses utf-8 internally.  When presented with a different charset,
> mod_proxy_html has to convert (or setup the parser to convert internally), and
> mod_proxy_html 2.x always generates output as utf-8.

Right - using an xml parser for sgml has several interesting side effects :)

> The complaint was that when this happens to a page containing a <form>,
> it would cause browsers to submit the form data as utf-8, which in turn
> screwed up some peoples applications.  It's not a problem I've had myself,
> but a few users made the case coherently, so I felt compelled to fix it by
> enabling the user to specify an output charset of choice.

So just out of curiosity, the module always emits the charset=utf-8 property
for the request body content-type?  Tomcat, for example, should parse such
request bodies with no issue.  Only non-utf-8 aware, custom applications
that don't a charset-aware parser should fail.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by Nick Kew <ni...@webthing.com>.
On 18 Oct 2008, at 20:22, William A. Rowe, Jr. wrote:

> Comments inline; You painted this situation today with an overly  
> broad brush,
> there are some remaining issues but they are much narrower than you  
> identify
> below...

You seem to have summed up the issues.  Just to expand with a  
practical point.

> So what does the HTML spec have to say?  The <FORM > submission  
> element
> does include the accept-charset attribute, perhaps that is what you  
> are
> looking for?  Otherwise, if the user agents don't observe RFC 2388  
> then
> you should really take that up with the user agent vendors.

This became a (relatively) frequent complaint with mod_proxy_html  
2.x, and
one of the motivations behind the updates in 3.0.

The issue: libxml2 uses utf-8 internally.  When presented with a  
different charset,
mod_proxy_html has to convert (or setup the parser to convert  
internally), and
mod_proxy_html 2.x always generates output as utf-8.

The complaint was that when this happens to a page containing a <form>,
it would cause browsers to submit the form data as utf-8, which in turn
screwed up some peoples applications.  It's not a problem I've had  
myself,
but a few users made the case coherently, so I felt compelled to fix  
it by
enabling the user to specify an output charset of choice.

So as you say, there is an issue, but I think this is indeed the  
extent of it.

-- 
Nick Kew

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by André Warnier <aw...@ice-sa.com>.
William, Nick,

William A. Rowe, Jr. wrote:
[...]
> 
> The reason your search was futile is that you want to focus on searching
> internet-draft where there are proposals in this sphere.  Also watch the
> dependencies of the http draft, many of those have also evolved and are
> beginning to solve the utf8 situation.
> 
You may be right, and maybe my broad brush was due to a lack of
information. I am an experienced searcher, but I may have been looking
in the wrong places.  If there exist solutions or proposals in this
area that I failed to find, could you provide some relevant links ?

Thanks.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by André Warnier <aw...@ice-sa.com>.
William, Nick,

William A. Rowe, Jr. wrote:
[...]
> 
> The reason your search was futile is that you want to focus on searching
> internet-draft where there are proposals in this sphere.  Also watch the
> dependencies of the http draft, many of those have also evolved and are
> beginning to solve the utf8 situation.
> 
You may be right, and maybe my broad brush was due to a lack of
information. I am an experienced searcher, but I may have been looking
in the wrong places.  If there exist solutions or proposals in this
area that I failed to find, could you provide some relevant links ?

Thanks.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] I18N, HTTP 2.0 ?

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Comments inline; You painted this situation today with an overly broad brush,
there are some remaining issues but they are much narrower than you identify
below...

André Warnier wrote:
> 
> It is becoming urgent to create a new HTTP standard/version/revision,
> that would be organised around Unicode as a default character set, and
> UTF-8 as a default encoding.

The reason your search was futile is that you want to focus on searching
internet-draft where there are proposals in this sphere.  Also watch the
dependencies of the http draft, many of those have also evolved and are
beginning to solve the utf8 situation.

> Here are some areas where these problems appear :
> - the encoding of URLs.

That is not a problem.  URLs are essentially ASCII and the high order bit
byte domain is undefined.  So from a presentation perspective it can be
a problem, but technically and operationally this is not.  The only way
to represent URLs in the spirit of their design is to % encode the high
bit characters for presentation.  They can be UTF-8 or ISO-8859-1 (not
either-or, but the administrator's choice) and are easily typed in from
hardcopy (e.g. the tag on a TV commercial) by anyone using any character
set who has access to the ASCII subset.

Using "UTF-8" alone is not enough; to accept arbitrary characters is
to ignore the fact that there are multiple representations, often not
entirely synonymous, from visual references which are entered by the
user.  It's to ignore the issue of canonical forms when we are lucky
enough to have an astute reader.  So % encoding is the only safe data
entry format from the sensory world to the browser url bar.

> - the encoding of HTTP headers.

Headers?  I hope you mean header values.  *TEXT values clearly declare
how to shift to utf-8, but there's an ongoing discussion of how to fix
or broaden or clarify this on the http-wg list.

> - the encoding of user credentials in browser-side Basic and Digest
> authentication dialogs, and their transmission to the server.

Is a side effect of the HTTP headers question, and further it's a
UI design issue.

> - the encoding of input elements from html forms, as transmitted by a
> client to a server, and the interpretation of ditto data by the server

The RFC2616 http spec is clear on this and needs no further clarification.

7.2 Entity Body

   The entity-body (if any) sent with an HTTP request or response is in
   a format and encoding defined by the entity-header fields.

       entity-body    = *OCTET

   An entity-body is only present in a message when a message-body is
   present, as described in section 4.3. The entity-body is obtained
   from the message-body by decoding any Transfer-Encoding that might
   have been applied to ensure safe and proper transfer of the message.

7.2.1 Type


   When an entity-body is included with a message, the data type of that
   body is determined via the header fields Content-Type and Content-
   Encoding. These define a two-layer, ordered encoding model:

       entity-body := Content-Encoding( Content-Type( data ) )

And RFC2388 multipart/form-data spec is completely clear on this...

4.5 Charset of text in form data

   Each part of a multipart/form-data is supposed to have a content-
   type.  In the case where a field element is text, the charset
   parameter for the text indicates the character encoding used.

   For example, a form with a text field in which a user typed 'Joe owes
   <eu>100' where <eu> is the Euro symbol might have form data returned
   as:

    --AaB03x
    content-disposition: form-data; name="field1"
    content-type: text/plain;charset=windows-1250
    content-transfer-encoding: quoted-printable

    Joe owes =80100.
    --AaB03x

So what does the HTML spec have to say?  The <FORM > submission element
does include the accept-charset attribute, perhaps that is what you are
looking for?  Otherwise, if the user agents don't observe RFC 2388 then
you should really take that up with the user agent vendors.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org