You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Felix Meschberger (JIRA)" <ji...@apache.org> on 2008/06/05 14:55:45 UTC

[jira] Commented: (SLING-508) Parameter decoding uses wrong default charset

    [ https://issues.apache.org/jira/browse/SLING-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602634#action_12602634 ] 

Felix Meschberger commented on SLING-508:
-----------------------------------------

First off: Servlet Container parameters are not re-encoded by Sling (any more). They are taken as is.

Now, to what happens here:

On the one hand, the W3C [1] recomends browser vendors to encode non-ASCI characters in URLs in UTF-8. This should IMO also include the encoding of parameters in application/x-www-formurlencoded POSTed parameters, altough I could not find a real codification of this.

On the other hand, the Servlet Specification states, that all data read from POSTed content should be decoded with ISO-8859-1 encoding (Servlet API 2.4, Section 4.9). As servlet containers only read application/x-www-formurlencoded POST requests this issue is about these parameters.

Third, servlet containers are implemented inconsistently: Some (e.g. Tomcat) apply the Servlet API spec and read the data as ISO-8859-1 and some apply (e.g. Jetty) the W3C recommendation and read the data as UTF-8.

Fourth, browsers do not apply the W3C recomendation but instead encode the parameters in the character encoding of the page on which the form is placed.

Consider now the situation of a Servlet API conforming servlet container accepting form data of an UTF-8 encoded page: The parameters are encoded in UTF-8 and servlet container decodes this as ISO-8859-1 giving unreadable data. Conversely, if running in a W3C conforming container accepting form data of an ISO-8859-1 encoded page, the data will also be corrupt due to UTF-8 decoding of ISO-8859-1 data.

To come around this, we have very lilttle power. Best we can do is try to force the servlet container in decoding the parameter data in ISO-8859-1 and then to recode the raw data in whatever character encoding has been declared with the "_charset_" request parameter.

Two remarks:
(1) We use ISO-8859-1 because this encoding defines a 1:1 mapping of raw bytes to characters. In fact, the lower 256 characters of Unicode are exactly the characters from the ISO-8859-1 encoding. Thus ISO-8859-1 is kind of an identity encoding.
(2) "Trying to force" the container means, that we ensure the correct character set to use for reading the input, but if the input has already been read (e.g. by a filter outside Sling), we can not do much any more. This is probably not much of an issue, but we must be aware of it.


[1] http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

> Parameter decoding uses wrong default charset
> ---------------------------------------------
>
>                 Key: SLING-508
>                 URL: https://issues.apache.org/jira/browse/SLING-508
>             Project: Sling
>          Issue Type: Bug
>          Components: Engine
>    Affects Versions: 2.0.0
>            Reporter: Tobias Bocanegra
>            Assignee: Felix Meschberger
>            Priority: Blocker
>
> As of SLING-152 the request paremeters are re-encoded if a _charset_ parameter is present. it assumes that the default encoding is
> UTF-8 which is not the case for servlet spec compliant containers (eg. tomcat).
> change the default encoding to ISO-8851-1 or make it configurable.
> see: http://svn.apache.org/viewvc/incubator/sling/trunk/engine/src/main/java/org/apache/sling/engine/impl/parameters/Util.java?view=markup

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.