You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Micha Lenk <mi...@lenk.info> on 2018/05/25 16:43:47 UTC

mod_proxy_html and special characters

Hi all,

I'm currently facing an issue where the directive ProxyHTMLURLMap does 
not work. And I am not sure whether that is by design or not, and where 
I would appreciate some feedback.

Let's assume an imaginary backend server delivers a HTML page that 
contains a link like this:

<a href="http://internal/!%22%23$/">A link with special characters</a>

Please note that %22 is the double quote that needs to be encoded to not 
break the HTML, and %23 is the '#' character, which we don't want to get 
treated as anchor in this case. So, the unencoded URL would look like this:

http://internal/!"#$/

Now, Apache configured as reverse proxy should rewrite this link to 
http://external/!"#$/ (or http://external/!%22%23$/), but not any other 
links outside the sub directory /!"#$/ (nor /!%22%23$/). An imaginary 
configuration to achieve that and to showcase the issue I am trying to 
get feedback on looks like this:

ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"

Please note that the double quote is only escaped here with a backslash 
to cater for the Apache configuration syntax requirements. This does not 
work, i.e. the URL in the HTML document doesn't get rewritten.

Let's try to better understand what exactly is happening here. Looking 
into the code of mod_proxy_html.c (trunk, SVN rev. 1832252), this is 
where the string comparison happens:

  524              s_from = strlen(m->from.c);
  525              if (!strncasecmp(ctx->buf, m->from.c, s_from)) {
  ...                  ... do the string replacement ...


... where ctx->buf is the URL found in the HTML document, and m->from.c 
is the first configured argument of ProxyHTMLURLMap. So, if the latter 
is a prefix of the first, this condition should be true and the string 
replacement should happen. When the expected string replacement doesn't 
happen, the condition is false and the values of the variables are:

ctx->buf  = http://internal/!%22%23$/
m->from.c = http://internal/!"#$/

So, the strings don't match and are not replaced for that reason.

Going forward I am not interested in finding a work around for this, but 
more how to approach a fix (if this is a bug at all).

Is it reasonable to expect mod_proxy_html to rewrite URL encoded URLs as 
well?

Let's assume this needs to be fixed. To make the strings match, we could 
either URL escape the value from the Apache directive ProxyHTMLURLMap, 
or URL temporarily URL-decode the string found in the HTML document just 
for the purpose of the string comparison. What is the right thing to do?

If you have managed read all this down to this line, I am curious about 
your feedback. :)

Regards,
Micha

Re: mod_proxy_html and special characters

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Fri, May 25, 2018 at 11:57 AM, Eric Covener <co...@gmail.com> wrote:

> > <a href="http://internal/!%22%23$/">A link with special characters</a>
>
> > ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"
>
> > Is it reasonable to expect mod_proxy_html to rewrite URL encoded URLs as
> > well?
>
> IMO no, I don't think the literals in the first argument should be
> expected to match the URL-encoded content
>

Agreed that the pattern above should only match and pass (or reflect,
in a rewrite case) a literal '#' for a fragment. If you mean %23, don't
write it as '#'.

The %-enc should be retained, and matched distinctly, unless their
plaintext is equivalent, e.g. meets none of the sub-delim or delim or
restricted set. Which must therefore include %25, % encoded '%' itself.
Any %41 or 'A' are equivalent because their definition is an identity.
But I don't know that you can use %41 in the match pattern as we
would not decode that, and you likely can force any result to contain
a %41.

This is not well handled in general, there are ideas floating around,
but since there is no committee interest beyond 2.4.x and complete
division of opinion on how anything >2.4.x would be managed, it
looks most practical to clearly document existing observed behavior.

Re: mod_proxy_html and special characters

Posted by Nick Kew <ni...@apache.org>.
> On 28 May 2018, at 08:50, Micha Lenk <mi...@lenk.info> wrote:
> 
> The reason I am asking this is, because for Location matching, Apache httpd apparently does map a request with a URL encoded path to the non-encoded configured path. For example, if I have configured in a virtual host:

Yes of course httpd deals with encoding, as it must, in processing a request URL.

>  <Location "/!\"#$/">
>    ProxyPass "http://internal/!\"#$/"
>    ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"
>    ...
>  </Location>

mod_proxy_html is not processing a request URL, it's processing contents
in the response.  Contents destined, and encoded, for a HTTP Client.
The resemblence is entirely coincidental.  To align the behaviour
on grounds of consistency would seem to me misleading!

-- 
Nick Kew

Re: mod_proxy_html and special characters

Posted by Micha Lenk <mi...@lenk.info>.
Hi Eric,

On 05/25/2018 06:57 PM, Eric Covener wrote:
>> <a href="http://internal/!%22%23$/">A link with special characters</a>
>> >> ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"
>>
>> Is it reasonable to expect mod_proxy_html to rewrite URL encoded URLs as
>> well?
> > IMO no, I don't think the literals in the first argument should be
> expected to match the URL-encoded content

The reason I am asking this is, because for Location matching, Apache 
httpd apparently does map a request with a URL encoded path to the 
non-encoded configured path. For example, if I have configured in a 
virtual host:

   <Location "/!\"#$/">
     ProxyPass "http://internal/!\"#$/"
     ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"
     ...
   </Location>

... then for matching the location container it does not matter whether 
the path of the request is URL encoded or not.

I consider this behavior a bit inconsistent. URL-encoded requests get 
proxied to the internal resource as if they were not URL-encoded. But 
URL-encoding a few characters in the path is sufficient to bypass HTML 
rewriting.

Regards,
Micha

Re: mod_proxy_html and special characters

Posted by Eric Covener <co...@gmail.com>.
> <a href="http://internal/!%22%23$/">A link with special characters</a>

> ProxyHTMLURLMap "http://internal/!\"#$/" "http://external/!\"#$/"

> Is it reasonable to expect mod_proxy_html to rewrite URL encoded URLs as
> well?

IMO no, I don't think the literals in the first argument should be
expected to match the URL-encoded content

Re: mod_proxy_html and special characters

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
For clarity's sake, the spec defines these two entities as not-equal.

Of course, %41 and 'A' are equivilant, so such a function might not be a
bad thing to have in refactoring URI handling.

On Mon, May 28, 2018, 04:10 Nick Kew <ni...@apache.org> wrote:

>
> >> ctx->buf  = http://internal/!%22%23$/
> >> m->from.c = http://internal/!"#$/
>
> A further thought arising from that.
>
> Just as strcasecmp is case-independent, the world could no doubt use
> a standard library function that would treat the above as equal.
>
> Something like
> int stringcmp(const char *a, const char *b, unsigned int flags)
> where flags would control behaviour such as case-independence,
> and equivalence over URLencoding, HTML encoding, HTML entities,
> and whatever else someone might like to support (maybe integrate
> with locale too?).
>
> Anyone know of such a thing?
>
> --
> Nick Kew
>

Re: mod_proxy_html and special characters

Posted by Nick Kew <ni...@apache.org>.
>> ctx->buf  = http://internal/!%22%23$/
>> m->from.c = http://internal/!"#$/

A further thought arising from that.

Just as strcasecmp is case-independent, the world could no doubt use
a standard library function that would treat the above as equal.

Something like
int stringcmp(const char *a, const char *b, unsigned int flags)
where flags would control behaviour such as case-independence,
and equivalence over URLencoding, HTML encoding, HTML entities,
and whatever else someone might like to support (maybe integrate
with locale too?).

Anyone know of such a thing?

-- 
Nick Kew

Re: mod_proxy_html and special characters

Posted by Nick Kew <ni...@apache.org>.
> On 25 May 2018, at 17:43, Micha Lenk <mi...@lenk.info> wrote:
> 
> 524              s_from = strlen(m->from.c);
> 525              if (!strncasecmp(ctx->buf, m->from.c, s_from)) {
> ...                  ... do the string replacement ...
> 
> 
> ... where ctx->buf is the URL found in the HTML document, and m->from.c is the first configured argument of ProxyHTMLURLMap. So, if the latter is a prefix of the first, this condition should be true and the string replacement should happen. When the expected string replacement doesn't happen, the condition is false and the values of the variables are:
> 
> ctx->buf  = http://internal/!%22%23$/
> m->from.c = http://internal/!"#$/
> 
> So, the strings don't match and are not replaced for that reason.

Yep.  mod_proxy_html takes what it sees.  That's why it relies on another module
(mod_xml2enc) for i18n, which is kind-of what I expected to see from your
subject line!

> Going forward I am not interested in finding a work around for this, but more how to approach a fix (if this is a bug at all).
> 
> Is it reasonable to expect mod_proxy_html to rewrite URL encoded URLs as well?

I think it's reasonable to use the escaped html in your ProxyHTMLURLMap.
If we have mod_proxy_html unescape characters, it adds complexity to the code,
and (perhaps more to the point) presents a mirror-image of your problem to
anyone with the opposite expectations.

> Let's assume this needs to be fixed. To make the strings match, we could either URL escape the value from the Apache directive ProxyHTMLURLMap, or URL temporarily URL-decode the string found in the HTML document just for the purpose of the string comparison. What is the right thing to do?

I prefer to leave it to server admins to find the match that works for them.
I don't recollect this particular question ever arising in 15 years, which kind-of
suggests users are not confused by it!

-- 
Nick Kew