You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Barry Lind <ba...@xythos.com> on 2000/11/06 23:14:50 UTC

Question on what character encoding to use for URI-decoding

We are having a problem with running our servlets on different servlet
containers when it comes to the URI-decoding of various parts of the
request URI.

Thankfully the 2.3 Servlet Spec addresses many of the problems by
explicitly specifying what methods decode and what methods don't decode
(for example getRequestURI() does not decode, but getPathInfo() does). 

However, the 2.3 spec remains unclear as to how exactly to convert a
URI-encoded string into a java String, specifically what character set
is to be used.  The spec in most places simply says the returned value
is decoded (and does not specify what character set is used).  However
in two places in the spec it does attempt to be more specific by saying
"and characters ... are converted to ASCII characters."  

Saying that the octets are converted to ASCII characters is vague. 
ASCII (7bit ASCII) only specifies the first 127 characters, but the %xx
notation allows you to pass any octet.  What character set is used to
convert these other octets (i.e. %80 - %ff) into characters?

In looking at the HTTP1.1 (RFC 2616) and URI (RFC 2396) RFCs for
guidance, here is what can be found.  

Section 2.1 of RFC 2396 (titled URI and non-ASCII characters) states: 
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII.

Both the HTTP and URI RFCs make a distiction between octets (i.e bytes)
and characters.

To summarize what the RFCs say:

The URI must use a limited set of ASCII characters for portability and
other reasons.

The content of the URI may be any set of octets (not limited to the
ASCII subset of characters mentioned above).  

For octets that do not map to the allowable set of ACSII characters in
the URI, these octets must be encoded with the % hex hex notation.

As section 2.1 of RFC 2396 states, "There are two mappings, one from URI
characters to octets, and a second from octets to original characters:
URI character sequence->octet sequence->original character sequence". 
This means that RFC 2396 specifies how to convert the ASCII characters
of the URI into octets, but a separate conversion of the octets to their
original character sequence is necessary to get the resultant true set
of characters.  This spec also mentions using UTF-8 as the conversion
method from these octets to characters (but doesn't require UTF-8).

The HTTP RFC is actually quiet with regards to this character set issue
for the conversion of the octets to characters, and essentially bypasses
the issue by saying, "3.2.3 URI Comparison.  When comparing two URIs to
decide if they match or not, a client SHOULD use a case-sensitive
octet-by-octet comparison of the entire URIs."  By saying the comparison
of URIs is based on the octets (not characters) the question of what
character encoding is used to convert the octets to characters is
avoided. (Perhaps the sevlet spec should also have methods that return
byte[]'s to allow octet-by-octet comparison as the RFC states)

Since the servlet spec defines certain methods (eg. getPathInfo()) to
return a string (and since there is no method to get the raw octets),
the character encoding issue comes into play with the servlet spec.


The servlets we provide in our product support multi-lingual file names
using unicode.  We use the UTF-8 character encoding to convert the
unicode characters into octets for the URI.  These octets are then
encoded into ASCII characters according to RFC 2396.  All of this is
completely in line with the RFCs.  So for example a Thai character in
unicode is represented as three octets in the UTF8 character encoding. 
So if we have a file named  "/mydocs/<a Thai character>.doc" and we want
to make a valid URI of this file it would appear as:
http://server/servlet/fileservlet/mydocs/%xx%yy%zz.doc

where the unicode string "/mydocs/<a Thai character>.doc" is converted
to octets using the UTF-8 encoding, this set of octets is then encoded
with the % hex hex notation to make a valid URI.

The problem arises when we receive one of these URIs from the servlet
container.  Since the servlet spec states that the getPathInfo() method
returns a string and the string must be decoded, the servlet container
must make some assumption on what character encoding to use to convert
the octets into a string.  

As long as the servlet spec clearly states what character encoding is
used to convert the octets of the URI into a string, I am then able to
undo that to get the raw octets back so that I can convert them to the
correct character encoding for my application (UTF-8).  If the servlet
spec isn't clear on this point and different servlet container
implementations use different character encodings, I have no way of
getting the correct characters in a portable application.  I need to
write servlet container specific code to do the correct thing.

So I would suggest that the servlet spec be clarified to specifically
state which character encoding is used to convert the octets from the
request URI into characters (and this character set must map all 8 bit
values, thus the current statement of 'use ASCII' is insufficient).  I
would further suggest that this character encoding be UTF-8 since the
first 127 code points are the same as ASCII.  Also UTF-8 is a better
choice for an international specification.

thanks,
--Barry
CTO Xythos Software Inc.

Re: Question on what character encoding to use for URI-decoding

Posted by se...@eng.sun.com.
Thank you for your feedback on the Servlet API. Your feedback will be
read by an engineer on the Java Servlet API Team and given serious
consideration. We will contact you directly if we have further
questions about your feedback.

----------------------------------------------------------------------

We do not perform sales or technical support from this address.  This
is worth repeating: you will not receive any additional mail, unless
we have questions on your feedback. Please contact one of our other
support channels (below) if you require support.

To place a bug report directly into our database, you may enter your
bug here: http://java.sun.com/cgi-bin/bugreport.cgi

For licensing, sales and schedule information, please contact
1-888-THEJAVA. If outside the US, please dial 1-(512)434-1591

----------------------------------------------------------------------

For more discussion of the servlet API please consider joining the
servlet-interest mailing list.

You may subscribe to the mailing list by sending an email to:

LISTSERV@JAVASOFT.COM

with the _body_ of the message containing the line

SUBSCRIBE SERVLET-INTEREST Full-Name-Here

where Full-Name-Here is your name.

Discussions of programming Java Servlets, and server side Java
programming in general, are carried out on the Usenet newsgroup
comp.lang.java.programmer.

----------------------------------------------------------------------

Thank you for your time and input.

The Servlet API Team