You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Graham Leggett <mi...@sharp.fm> on 2010/09/18 19:10:26 UTC

mod_include: echo, entity encoding and UTF-8

Hi all,

When the SSI tag below is handled, the value of the string output to  
the browser is entity encoded:

<!--#echo encoding="entity" var="MY_VAR"-->

This is done with a line that looks something like this:

/* PR#25202: escape anything non-ascii here */
echo_text = ap_escape_html2(ctx->dpool, val, 1);

The problem with the above is the parameter "1", which means that non- 
ASCII characters are entity encoded as html escape sequences, and in  
the process anything encoded with UTF-8 (and is not ASCII) breaks.

What I propose we do is change the value for v2.3+ as follows:

echo_text = ap_escape_html2(ctx->dpool, val, 0);

This allows UTF-8 character sequences to be passed through unchanged.

Past discussion in PR#25202 seems to revolve around backwards  
compatibility, though with v2.4+ we have the power to change this  
behaviour.

Does any cross site scripting risk result as the allowance of UTF-8  
character sequences? I understand not, but would like to confirm.

Regards,
Graham
--


Re: mod_include: echo, entity encoding and UTF-8

Posted by Graham Leggett <mi...@sharp.fm>.
On 18 Sep 2010, at 7:10 PM, Graham Leggett wrote:

> When the SSI tag below is handled, the value of the string output to  
> the browser is entity encoded:
>
> <!--#echo encoding="entity" var="MY_VAR"-->
>
> This is done with a line that looks something like this:
>
> /* PR#25202: escape anything non-ascii here */
> echo_text = ap_escape_html2(ctx->dpool, val, 1);
>
> The problem with the above is the parameter "1", which means that  
> non-ASCII characters are entity encoded as html escape sequences,  
> and in the process anything encoded with UTF-8 (and is not ASCII)  
> breaks.

Looking further at PR25202, this caused a regression described in  
PR47686 where UTF-8 support broke.

I've created a fix for this, where the "set" and "echo" SSI command  
have been taught to handle "encoding" and "decoding" parameters.

For both echo and for set, the value is first decoded by the given  
parameter, and then encoded by the given parameter. This allows full  
control of the encoding and decoding of variables and echoed  
parameters, depending on where they came from.

Encoding and decoding can contain multiple values, so that you can for  
example strip off urlencoding, then entity encoding before using a  
value, like this: decoding="url,entity".

Regards,
Graham
--