You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2021/04/01 16:08:55 UTC

[PROPOSAL] Explicitly-set the request character encoding when it has been committed

All,

Yesterday, I struggled to determine why my application was behaving 
differently than it had been in the past, and the problem turned out to 
be that I had inserted a <filter> in the filter-chain before my 
CharacterEncodingFilter. My new <filter> was reading a 
request-parameter. The CharacterEncodingFilter looks like this:

     public void doFilter(ServletRequest request,
			 ServletResponse response,
			 FilterChain chain)
	throws IOException, ServletException
     {
	request.setCharacterEncoding(getCharacterEncoding(request));

	chain.doFilter(request, response);
     }

     protected String getCharacterEncoding(ServletRequest request)
     {
	String charset=request.getCharacterEncoding();

	if(null == charset)
	    return this.getDefaultEncoding();
	else
	    return charset;
     }

The "default encoding" is essentially always UTF-8.

When looking at the request in my servlet, the character encoding 
reported by the request was "UTF-8", but the actual encoding used 
appeared to be ISO-8859-1 (the protocol default).

It looks like Tomcat is defaulting to ISO-8859-1 but continuing to 
return null for request.getCharacterEncoding().

My proposal is to have Tomcat set the request encoding field to 
"ISO-8859-1" in the following situation:

1. The character encoding is null
2. A method is called which requires that the character encoding be 
"committed"

Once that charset is determined, changing the request's charset has no 
effect whatsoever other than to confuse the application developer.

If Tomcat were to explicitly-set that encoding, my 
CharacterEncodingFilter wouldn't detect null and override that request 
charset with "UTF-8", thereby lying to the rest of the application.

I might even go so far as to propose that calling 
request.setCharacterEncoding() after the encoding has been committed 
should throw IllegalStateException. (This may be a violation of the 
spec, as the javadoc only declares UnsupportedEncodingException).

The javadoc state:

"
Overrides the name of the character encoding used in the body of this 
request. This method must be called prior to reading request parameters 
or reading input using getReader(). *Otherwise, it has no effect.*
"

(emphasis mine)

In Tomcat, if you call setCharacterEncoding("UTF-8") after request 
parameters have been read, there *is* an effect: you change the value of 
the charset field in the response.

This is the current implementation of setCharacterEncoding:

     /**
      * Overrides the name of the character encoding used in the body of
      * this request.  This method must be called prior to reading request
      * parameters or reading input using <code>getReader()</code>.
      *
      * @param enc The character encoding to be used
      *
      * @exception UnsupportedEncodingException if the specified encoding
      *  is not supported
      *
      * @since Servlet 2.3
      */
     public void setCharacterEncoding(String enc) throws 
UnsupportedEncodingException {

         if (usingReader) {
             return;
         }

         // Confirm that the encoding name is valid
         Charset charset = B2CConverter.getCharset(enc);

         // Save the validated encoding
         coyoteRequest.setCharset(charset);
     }

The javadoc says that it must be called before reading any request 
parameters OR calling getReader() but there is only a check for the reader.

Maybe we should change the check to:

         if (usingReader || parametersParsed) {
             return;
         }

And also, change Request.parseParameters to add:

             // getCharacterEncoding() may have been overridden to 
search for
             // hidden form field containing request encoding
             Charset charset = getCharset();

             // Add this line, here:
             coyoteRequest.setCharset(charset);

It is at this point that the character set is truly committed, at least 
when parsing parameters.

In getReader, we have similar logic, where we set the character set for 
the request after determining what it actually is:

         if (coyoteRequest.getCharacterEncoding() == null) {
             // Nothing currently set explicitly.
             // Check the content
             Context context = getContext();
             if (context != null) {
                 String enc = context.getRequestCharacterEncoding();
                 if (enc != null) {
                     // Explicitly set the context default so it is 
visible to
                     // InputBuffer when creating the Reader.
                     setCharacterEncoding(enc);
                 }
             }
         }

I'm open to any or all of the above, but I think something should be 
done. It was surprising to see that the request's charset was "UTF-8" 
and yet the UTF-8 bytes sent to the container were coming out mangled in 
the application.

If the application were to see null coming back from 
request.getCharacterEncoding() would have at least giving a clue as to 
what was happening. Since the effective charset being used was 
ISO-8859-1, it would have been better to return "ISO-8859-1" instead of 
"null" knowing that the parameters had *already* been interpreted using 
that character set.

Thanks,
-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

Posted by Mark Thomas <ma...@apache.org>.
On 05/04/2021 13:19, Christopher Schultz wrote:

<snip/>

> I'm happy with incremental changes. Making the first change will be an 
> improvement for sure.
> 
> On the other hand, if code calls request.setCharacterEncoding(non-null) 
> then it will overwrite whatever the request had previously sent 
> (including null). So in that case, request.getCharacterEncoding isn't 
> always returning "what the client sent".

Another place where there is scope to add further clarity to the spec.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Mark,

On 4/1/21 16:00, Mark Thomas wrote:
> On 01/04/2021 17:08, Christopher Schultz wrote:
> 
> <snip/>
> 
>> The javadoc says that it must be called before reading any request 
>> parameters OR calling getReader() but there is only a check for the 
>> reader.
>>
>> Maybe we should change the check to:
>>
>>          if (usingReader || parametersParsed) {
>>              return;
>>          }
> 
> +1
> 
>> And also, change Request.parseParameters to add:
>>
>>              // getCharacterEncoding() may have been overridden to 
>> search for
>>              // hidden form field containing request encoding
>>              Charset charset = getCharset();
>>
>>              // Add this line, here:
>>              coyoteRequest.setCharset(charset);
> 
> I'm less sure of this because of this line in the Javadoc for 
> getCharacterEncoding():
> 
> @return ... <code>null</code> if the request does not specify a 
> character encoding
> 
>> It is at this point that the character set is truly committed, at 
>> least when parsing parameters.
>>
>> In getReader, we have similar logic, where we set the character set 
>> for the request after determining what it actually is:
> 
> Which suggests the second of the above changes might be OK even if the 
> spec suggests otherwise.
> 
> At the moment, I'm leaning towards just the first of your proposed 
> changes but I'm open to arguments to do more.

I'm happy with incremental changes. Making the first change will be an 
improvement for sure.

On the other hand, if code calls request.setCharacterEncoding(non-null) 
then it will overwrite whatever the request had previously sent 
(including null). So in that case, request.getCharacterEncoding isn't 
always returning "what the client sent".

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

Posted by Mark Thomas <ma...@apache.org>.
On 01/04/2021 17:08, Christopher Schultz wrote:

<snip/>

> The javadoc says that it must be called before reading any request 
> parameters OR calling getReader() but there is only a check for the reader.
> 
> Maybe we should change the check to:
> 
>          if (usingReader || parametersParsed) {
>              return;
>          }

+1

> And also, change Request.parseParameters to add:
> 
>              // getCharacterEncoding() may have been overridden to 
> search for
>              // hidden form field containing request encoding
>              Charset charset = getCharset();
> 
>              // Add this line, here:
>              coyoteRequest.setCharset(charset);

I'm less sure of this because of this line in the Javadoc for 
getCharacterEncoding():

@return ... <code>null</code> if the request does not specify a 
character encoding

> It is at this point that the character set is truly committed, at least 
> when parsing parameters.
> 
> In getReader, we have similar logic, where we set the character set for 
> the request after determining what it actually is:

Which suggests the second of the above changes might be OK even if the 
spec suggests otherwise.

At the moment, I'm leaning towards just the first of your proposed 
changes but I'm open to arguments to do more.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org