You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by William A Rowe Jr <wr...@rowe-clan.net> on 2016/09/12 15:49:47 UTC

StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rp...@apache.org> wrote:

>
> On 08/29/2016 06:25 PM, William A Rowe Jr wrote:
> > Thanks all for the feedback. Status and follow-up questions inline
> >
> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <wrowe@rowe-clan.net
> <ma...@rowe-clan.net>> wrote:
> >
> >     4. Should the next 2.4/2.2 releases default to Strict[URI] at all?
> >
> >     Real world direct observation especially appreciated from actual
> deployments.
> >
> > Strict (and StrictURI) remain the default.
>
> StrictURI as a default only makes sense if we have our own house in order
> (see above), otherwise it should be opt in.

So it's not only our house [our %3B encoding in httpd isn't a showstopper
here]... but also whether widely used user-agent browsers and tooling have
their houses in order, so I started to study the current browser behaviors.
The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3

Checked the unreserved set with '?' and '/' observing special meanings.
Nothing here should become escaped when given as a URI;
http://localhost:8080/unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query

Checked the invalid set of characters all of which must be encoded
per the spec, and verify that #frag is not passed to the server;
http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}#frag

Checked the reserved set including '#' '%' '?' by their encoded value
to determine if there are any unpleasant reverse surprises lurking;
http://localhost:8080/encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D

Checked a list of unreserved/unassigned gen-delims and sub-delims
to determine if the user agent normalizes while composing the request;
http://localhost:8080/plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E

Using the simplistic $ nc -kl localhost 8080 here are the results
I obtained from a couple of current browsers, more observations and
feedback
of other user-agents to this list would be appreciated.

Chrome 53:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^

Firefox 48:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1
odd>            ^^                     ^         ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET
/plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E
HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^
 ^

The character '\' is converted to a '/' by both browsers, in a nod either
to Microsoft insanity, or a less-accessible '/' key. (Which suggests that
the yen sign might be treated similarly in some jp locales.) Invalid as a
literal '\' character, both browsers support an explicit %5C for those who
really want to use that in a URI. No actual issue here.

Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3 grammer
(as I've tested above), while '[' and ']' are omitted and therefore not
allowed
according to spec. (On this, StrictURI won't care yet, because we are
simply correcting for any valid URI character, not by section, and '[' ']'
are
obviously allowed for the IPv6 port specification - so we don't reject yet.)
When we add strict parsing to the apr uri parsing function, we will trip
over this, from all browsers, in spite of these being prohibited and
declared
unwise for the past 18 years or more.

The character '|' is also invalid. However, Firefox fails to follow the spec
again here (although Chrome gets it right).

With respect to these characters, recall this 18 year old document,
last paragraph describes the rational;
https://tools.ietf.org/html/rfc2396.html#section-2.4.3

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

   Data corresponding to excluded characters must be escaped in order to
   be properly represented within a URI.

Which replaced https://tools.ietf.org/html/rfc1738#section-2.2 now
almost 22 years old, without changing the rules;

   Unsafe:

   Characters can be unsafe for a number of reasons.  The space
   character is unsafe because significant spaces may disappear and
   insignificant spaces may be introduced when URLs are transcribed or
   typeset or subjected to the treatment of word-processing programs.
   The characters "<" and ">" are unsafe because they are used as the
   delimiters around URLs in free text; the quote mark (""") is used to
   delimit URLs in some systems.  The character "#" is unsafe and should
   always be encoded because it is used in World Wide Web and in other
   systems to delimit a URL from a fragment/anchor identifier that might
   follow it.  The character "%" is unsafe because it is used for
   encodings of other characters.  Other characters are unsafe because
   gateways and other transport agents are known to sometimes modify
   such characters. These characters are "{", "}", "|", "\", "^", "~",
   "[", "]", and "`".

   All unsafe characters must always be encoded within a URL.

While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission
from RFC3986, the 'must' designation couldn't have been any clearer.
We've had this right for 2 decades at httpd.

Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1
goes into some detail about this change, and while it is hard to parse,
the paragraph is stating that '[' ']' were once invalid, now are reserved,
and remain disallowed in all other path segments and use cases.

The upshot, right now StrictURI will accept '[' and ']', but this won't
survive
a rewrite of the apr parser operating with a 'strict' toggle. StrictURI does
not accept '|'. The remaining question is what to do, if anything, about
carving a specific exception here due to modern Firefox issues.

Thoughts/Comments/Additional test data?  TIA!

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Sep 13, 2016 3:36 AM, "Yann Ylavic" <yl...@gmail.com> wrote:
>
> On Tue, Sep 13, 2016 at 10:10 AM, Ruediger Pluem <rp...@apache.org>
wrote:
> >
> >
> > On 09/13/2016 04:19 AM, Eric Covener wrote:
> >> On Mon, Sep 12, 2016 at 5:38 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:
> >>> It really seems that if a major client is not handling "|" correctly,
we
> >>> need to carve out an exception,
> >>
> >> +1 to allow it.
> >>
> >> For others who might hit a maze of closed/duped bug reports this one
> >> is active this year:
> >> https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
> >>
> >
> > +1
>
> +1, but I wouldn't put the exception in T_URI_RFC3986 (rather where it is
used).
>
> We could still do things like:
>     ll = uri;
>     /* firefox bug exception w.r.t pipe */
>     while ((ll = ap_scan_http_uri_safe(ll)) && *ll == '|')
>         ll++;
>
> Less simple, but it looks more correct to me, once firefox fixes its
> bug we don't change T_URI_RFC3986's (expected) behaviour.

No, that's a mess. If an exception exists and isn't honored because that fn
was called from a different module, it would be a nightmare. How we work
around this in a new apr strict parser might look different.

It is documented in the URI set as a (Firefox) bug. Fix in one line. Later
undo this fix this in a single source file.

It will have to live there for years, long after 2.2 is forgotten and to
the end of 2.next's lifespan. Browsers don't change overnight. RHEL5
Firefox is certainly in use. If Firefox had fixed this before users were
pushed TLS 1.2 updates for next year's SSL, that would be a different story.

The theory was that only UI agent authors are in a position to fix a URI
encoding bug. They ignore error response details. Changing this for
end-users of Firefox is a total hassle.

Thanks for the suggestion though.

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Yann Ylavic <yl...@gmail.com>.

On Tue, Sep 13, 2016 at 10:10 AM, Ruediger Pluem <rp...@apache.org> wrote:
>
>
> On 09/13/2016 04:19 AM, Eric Covener wrote:
>> On Mon, Sep 12, 2016 at 5:38 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
>>> It really seems that if a major client is not handling "|" correctly, we
>>> need to carve out an exception,
>>
>> +1 to allow it.
>>
>> For others who might hit a maze of closed/duped bug reports this one
>> is active this year:
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
>>
>
> +1

+1, but I wouldn't put the exception in T_URI_RFC3986 (rather where it is used).

We could still do things like:
    ll = uri;
    /* firefox bug exception w.r.t pipe */
    while ((ll = ap_scan_http_uri_safe(ll)) && *ll == '|')
        ll++;

Less simple, but it looks more correct to me, once firefox fixes its
bug we don't change T_URI_RFC3986's (expected) behaviour.

Regards,
Yann.

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Ruediger Pluem <rp...@apache.org>.


On 09/13/2016 04:19 AM, Eric Covener wrote:
> On Mon, Sep 12, 2016 at 5:38 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
>> It really seems that if a major client is not handling "|" correctly, we
>> need to carve out an exception,
> 
> +1 to allow it.
> 
> For others who might hit a maze of closed/duped bug reports this one
> is active this year:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
> 

+1

Regards

R�diger

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by "Roy T. Fielding" <fi...@gbiv.com>.

> On Sep 14, 2016, at 6:28 AM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> 
> On Tue, Sep 13, 2016 at 5:07 PM, Jacob Champion <champion.p@gmail.com <ma...@gmail.com>> wrote:
> On 09/13/2016 12:25 PM, Jacob Champion wrote:
> What is this? Is this the newest "there are a bunch of almost-right
> implementations so let's make yet another standard in the hopes that it
> won't make things worse"? Does anyone know the history behind this spec?
> 
> (My goal in asking this question is not to stare and point and laugh, but more to figure out whether we are skating to where the puck is going. It would be nice for users to know which specification StrictURI is being strict about.)
> 
> RFC3986 as incorporated by and expanded upon by reference in RFC7230. 
> 
> IP, TCP, HTTP and it's data and framing are defined by the IETF. HTTP's
> definition depends on the meaning of many things, including ASCII, URI
> syntax, etc, see its table of citations. The things it depends on simply
> can't be moving targets any more than those definitions that the TCP 
> protocol is dependant upon. The IETF process is to correct a broken 
> underlying spec with a newly revised spec subject to peer review, and 
> then update the consuming specs to leverage the changes in the 
> underlying, where necessary (in some cases the revised underlying
> spec, once applied, has no impact on the consuming spec.)
> 
> HTML folks use URL's, and therefore forked the spec they percieved as
> too rigid and inflexible. In fact, it wasn't, but it appears so if you read the
> spec as requiring -users- to -type- valid URI's, which was never the case.
> Although it gets prickly if you consider handling badly authored href= links 
> in html. HTML became a "living spec" subject to perpetual evolution;
> this results in a state where all implementations are perpetually broken.
> But the key take-away is that whattfwg URI does not and cannot
> supercede RFC3986 for the purposes of RFC7230. Rather than improve
> the underlying spec, the group decided to overlay an unrelated spec.
> 
> https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/ <https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/> does one
> decent job explaining some of this. Google "URI whatwg vs. ietf" for
> an excessively long list of references.
> 
> So in short, whatwg spec describes URI's anywhere someone wants
> to apply their defintion; HTML5 is based upon this. The wire protocol 
> of talking to an http: schema server is defined by RFC7230, which 
> subordinates to the RFC3986 definition of a URI. How you choose to 
> apply these two specs depends on your position in the stack.

I don't consider the WHATWG to be a standards organization, nor should
anyone else. It is just a selective group (a clique) with opinions about
software that they didn't write and a desire to document it in a way that
excludes the interests of everyone other than browser developers.

The main distinction between the WHATWG "URL standard" (it isn't)  and
the IETF URI standard (it is, encompassing URL and URN) is that HTML5
needs to define the url object in DOM (what is basically an object containing
a parsed URI reference), whereas the IETF needs to define a grammar for
the set of uniform identifiers believed to be interoperable on the Internet.

Obviously, if one spec wants to define everything a user might input as a
reference and call that "URL", while the other wants to define the interoperable
identifier output after uniform parsing of a reference relative to a base URI
as a "URL", the two specs are not going to be compatible.

Do you think the empty string ("") is a URL?  I don't.

A normal author would have used two different terms to define the two
different things (actually, four different things, since the URL spec also uses
url to describe two other things related to URL processing). The IETF chose a
different term, 23 years ago, when it created the term URL instead of just
defining them as "WWW Addresses" or universal document identifiers.

Instead of making a rational effort to document references in HTML, the
WHATWG decided to go on an ego trip about what "real developers" call
a "URL", and then embarked on yet another political effort to reject IETF
standards (that represent the needs of all Internet users, not just
browser developers) in favor of their own "living standards" that only
reflect a figment of the author's imagination (not implementations).

Yes, a user agent will send invalid characters in a request URI.  That is a bug
in the user agent.  Even if every browser chose to do it, that is still a bug in
the browser (not a bug in the spec). The spec knows that those addresses
are unsafe on the command-line and therefore unable to be properly
handled by many parts of the Internet that are not browsers, whereas
the correctly encoded equivalent is known to be interoperable. Hence,
the real standard requires that they be sent in an interoperable form.

Anyway, we have to be careful when testing to note that what a user agent
does with a reference is often dependent on the context in which it receives
the reference.  The requirements of the URI spec are mostly about generation
of an interoperable URI, rather than making a request containing an arbitrary
URI reference. Hence, some browsers will only encode a URI properly when
they have control over the generation process, leaving the responsibility for
proper encoding of other references to the authors creating those links.
Thus, a user agent might encode the request URI differently if the reference
is received in an href than it would when the same string is typed in the
address dialog, constructed via javascript, or stored within a bookmark.
Likewise, some user agents (like curl and wget) will send invalid characters
in a request URI because they are deliberately chosen for pen testing.

RFCs never limit what a component can send, since conformance is voluntary.
What they limit is the range of chaos that is considered interoperable, with
an expectation that a normal sender will want to conform, for its own sake,
and a normal recipient can feel free to ignore or error on non-conformance.

....Roy

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Sep 14, 2016 12:59 PM, "Ruediger Pluem" <rp...@apache.org> wrote:
>
> On 09/14/2016 07:17 PM, Jacob Champion wrote:
> >
> > I think that's bad from a documentation and usability standpoint. If
WHATWG (hypothetically) decided to bless more
> > exceptions to the RFC, would we follow suit with StrictURI? Is
StrictURI *really* an option to follow the RFCs to the
> > letter, or is it an option to try to do things as correctly as we can
without breaking major browsers?
>
> I think it should be the later one in this case. I see no real use case
for an option that makes it fail with major
> browsers.

I've reached the same conclusion and will not pursue the Firefox exception,
so that everyone can use these results for reference.

StrictURI can be used by app developers, UA developers, content authors etc
to verify their conformance.

Note that in the URI Path segment, none of these wonky exceptions actually
matter, if the content authors, app devs and admins do not create valid
URIs without proper encoding. E.g. StrictURI is perfectly usable for most
path resources of most conventional sites. Even if resources have these
unsafe chars, if they correctly encode their own href= it will work fine.

It is the complete failure in all browsers to correctly submit query args
that make this all unusable. If and when browsers correct their encoding,
this will become usable in production.

We can clearly document this in the directive docs and leave StrictURI
not-recommended for general deployments.

When we add a strict mode in apr 1.next, we can validate per-segment, and
even exclude the query args from strict behavior, if desired. But that's
beyond the scope of what I propose in the short term.

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Ruediger Pluem <rp...@apache.org>.


On 09/14/2016 07:17 PM, Jacob Champion wrote:
> On 09/14/2016 06:28 AM, William A Rowe Jr wrote:
>> On Tue, Sep 13, 2016 at 5:07 PM, Jacob Champion <champion.p@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     (My goal in asking this question is not to stare and point and
>>     laugh, but more to figure out whether we are skating to where the
>>     puck is going. It would be nice for users to know which
>>     specification StrictURI is being strict about.)
>>
>> RFC3986 as incorporated by and expanded upon by reference in RFC7230.
> 
> ...plus at least (if I'm understanding correctly) three exceptions ('|', '[', ']') because of what we consider to be
> bugs in popular browsers.
> 
> FWIW, I am +1 to those exceptions because I think it's the pragmatic thing to do. But based on the linked Mozilla bug
> thread, if they have decided to forsake the IETF RFCs and are instead following a separate "specification" that has a
> habit of simply tracking things as they are, there's a decent chance that those bugs will not be fixed. In which case
> StrictURI will never be "strict".
> 
> I think that's bad from a documentation and usability standpoint. If WHATWG (hypothetically) decided to bless more
> exceptions to the RFC, would we follow suit with StrictURI? Is StrictURI *really* an option to follow the RFCs to the
> letter, or is it an option to try to do things as correctly as we can without breaking major browsers?

I think it should be the later one in this case. I see no real use case for an option that makes it fail with major
browsers.

Regards

R�diger

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Jacob Champion <ch...@gmail.com>.

On 09/14/2016 06:28 AM, William A Rowe Jr wrote:
> On Tue, Sep 13, 2016 at 5:07 PM, Jacob Champion <champion.p@gmail.com
> <ma...@gmail.com>> wrote:
>
>     (My goal in asking this question is not to stare and point and
>     laugh, but more to figure out whether we are skating to where the
>     puck is going. It would be nice for users to know which
>     specification StrictURI is being strict about.)
>
> RFC3986 as incorporated by and expanded upon by reference in RFC7230.

...plus at least (if I'm understanding correctly) three exceptions ('|', 
'[', ']') because of what we consider to be bugs in popular browsers.

FWIW, I am +1 to those exceptions because I think it's the pragmatic 
thing to do. But based on the linked Mozilla bug thread, if they have 
decided to forsake the IETF RFCs and are instead following a separate 
"specification" that has a habit of simply tracking things as they are, 
there's a decent chance that those bugs will not be fixed. In which case 
StrictURI will never be "strict".

I think that's bad from a documentation and usability standpoint. If 
WHATWG (hypothetically) decided to bless more exceptions to the RFC, 
would we follow suit with StrictURI? Is StrictURI *really* an option to 
follow the RFCs to the letter, or is it an option to try to do things as 
correctly as we can without breaking major browsers?

(In any case, to echo R�diger: thanks very much for your research into 
this.)

--Jacob

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Sep 13, 2016 at 5:07 PM, Jacob Champion <ch...@gmail.com>
wrote:

> On 09/13/2016 12:25 PM, Jacob Champion wrote:
>
>> What is this? Is this the newest "there are a bunch of almost-right
>> implementations so let's make yet another standard in the hopes that it
>> won't make things worse"? Does anyone know the history behind this spec?
>>
>
> (My goal in asking this question is not to stare and point and laugh, but
> more to figure out whether we are skating to where the puck is going. It
> would be nice for users to know which specification StrictURI is being
> strict about.)

RFC3986 as incorporated by and expanded upon by reference in RFC7230.

IP, TCP, HTTP and it's data and framing are defined by the IETF. HTTP's
definition depends on the meaning of many things, including ASCII, URI
syntax, etc, see its table of citations. The things it depends on simply
can't be moving targets any more than those definitions that the TCP
protocol is dependant upon. The IETF process is to correct a broken
underlying spec with a newly revised spec subject to peer review, and
then update the consuming specs to leverage the changes in the
underlying, where necessary (in some cases the revised underlying
spec, once applied, has no impact on the consuming spec.)

HTML folks use URL's, and therefore forked the spec they percieved as
too rigid and inflexible. In fact, it wasn't, but it appears so if you read
the
spec as requiring -users- to -type- valid URI's, which was never the case.
Although it gets prickly if you consider handling badly authored href=
links
in html. HTML became a "living spec" subject to perpetual evolution;
this results in a state where all implementations are perpetually broken.
But the key take-away is that whattfwg URI does not and cannot
supercede RFC3986 for the purposes of RFC7230. Rather than improve
the underlying spec, the group decided to overlay an unrelated spec.

https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/ does one
decent job explaining some of this. Google "URI whatwg vs. ietf" for
an excessively long list of references.

So in short, whatwg spec describes URI's anywhere someone wants
to apply their defintion; HTML5 is based upon this. The wire protocol
of talking to an http: schema server is defined by RFC7230, which
subordinates to the RFC3986 definition of a URI. How you choose to
apply these two specs depends on your position in the stack.

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Jacob Champion <ch...@gmail.com>.

On 09/13/2016 12:25 PM, Jacob Champion wrote:
> What is this? Is this the newest "there are a bunch of almost-right
> implementations so let's make yet another standard in the hopes that it
> won't make things worse"? Does anyone know the history behind this spec?

(My goal in asking this question is not to stare and point and laugh, 
but more to figure out whether we are skating to where the puck is 
going. It would be nice for users to know which specification StrictURI 
is being strict about.)

--Jacob

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Jacob Champion <ch...@gmail.com>.

On 09/13/2016 08:55 AM, William A Rowe Jr wrote:
> On Mon, Sep 12, 2016 at 9:19 PM, Eric Covener <covener@gmail.com
> <ma...@gmail.com>> wrote:
>
>     For others who might hit a maze of closed/duped bug reports this one
>     is active this year:
>     https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
>     <https://bugzilla.mozilla.org/show_bug.cgi?id=1064700>
>
>
> Makes for some disturbing reading... the amount of misinformation
> is truly mind-boggling (especially if you chase down the other reports.)
> Their aspirational goal of duplicating the mistakes of other the clients
> speaks for the wider UA community... sigh. Firefox since 'uncorrected'
> their originally correct handling of '[' and ']' to be equally out-of-spec.

One of the comments there points to

     https://url.spec.whatwg.org/

What is this? Is this the newest "there are a bunch of almost-right 
implementations so let's make yet another standard in the hopes that it 
won't make things worse"? Does anyone know the history behind this spec?

--Jacob

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Ruediger Pluem <rp...@apache.org>.


On 09/13/2016 08:02 PM, William A Rowe Jr wrote:
> On Tue, Sep 13, 2016 at 10:55 AM, William A Rowe Jr <wrowe@rowe-clan.net <ma...@rowe-clan.net>> wrote:
> 
>     On Mon, Sep 12, 2016 at 9:19 PM, Eric Covener <covener@gmail.com <ma...@gmail.com>> wrote:
> 
> 
>         For others who might hit a maze of closed/duped bug reports this one
>         is active this year:
>         https://bugzilla.mozilla.org/show_bug.cgi?id=1064700 <https://bugzilla.mozilla.org/show_bug.cgi?id=1064700>
> 
> 
>     Makes for some disturbing reading... the amount of misinformation
>     is truly mind-boggling (especially if you chase down the other reports.)
>     Their aspirational goal of duplicating the mistakes of other the clients
>     speaks for the wider UA community... sigh. Firefox since 'uncorrected' 
>     their originally correct handling of '[' and ']' to be equally out-of-spec.
> 
>     But it leads to a very thorough survey of the queryargs behavior of the
>     major browser families which is worth reviewing;
>     https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6 <https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6>
> 
> 
> unwise/unsafe aside, review the rest of that comment 6 survey.
> But a short synopsys...
> 
> IE fails to encode any byte 7F-FF in the query args (particularly noxious
> with DEL). Retested and this remains true of IE 12 on Windows 10.
> So UTF-8 query arg text is transmitted in raw bytes on IE in violation
> of RFC3986, while all other browsers encode these.
> 
> All browsers use U+FFFD to map the value NUL.  In respect to other
> discussions about ctrl chars, things get interesting. TAB/LF/CR are
> simply eaten and not sent to the server, while other CTRLs in all IE 
> query args are been considered invalid, and the browser refuses to
> transmit these. Trailing CTRLs on all browsers are simply discarded.
> 
> Given Microsoft's lead here in ignoring or refusing all CTRLs for query
> args (except DEL which they mishandle anyways) it it starting to look 
> especially safe to reject all %XX control chars when operating in the
> StrictURI mode (and as a non-default in 2.2/2.4).  Thoughts?

Sounds sensible. Thanks for your hard work on that topic Bill.

Regards

R�diger

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Sep 13, 2016 at 10:55 AM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Mon, Sep 12, 2016 at 9:19 PM, Eric Covener <co...@gmail.com> wrote:
>
>>
>> For others who might hit a maze of closed/duped bug reports this one
>> is active this year:
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
>>
>
> Makes for some disturbing reading... the amount of misinformation
> is truly mind-boggling (especially if you chase down the other reports.)
> Their aspirational goal of duplicating the mistakes of other the clients
> speaks for the wider UA community... sigh. Firefox since 'uncorrected'
> their originally correct handling of '[' and ']' to be equally out-of-spec.
>
> But it leads to a very thorough survey of the queryargs behavior of the
> major browser families which is worth reviewing;
> https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6
>

unwise/unsafe aside, review the rest of that comment 6 survey.
But a short synopsys...

IE fails to encode any byte 7F-FF in the query args (particularly noxious
with DEL). Retested and this remains true of IE 12 on Windows 10.
So UTF-8 query arg text is transmitted in raw bytes on IE in violation
of RFC3986, while all other browsers encode these.

All browsers use U+FFFD to map the value NUL.  In respect to other
discussions about ctrl chars, things get interesting. TAB/LF/CR are
simply eaten and not sent to the server, while other CTRLs in all IE
query args are been considered invalid, and the browser refuses to
transmit these. Trailing CTRLs on all browsers are simply discarded.

Given Microsoft's lead here in ignoring or refusing all CTRLs for query
args (except DEL which they mishandle anyways) it it starting to look
especially safe to reject all %XX control chars when operating in the
StrictURI mode (and as a non-default in 2.2/2.4).  Thoughts?

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Mon, Sep 12, 2016 at 9:19 PM, Eric Covener <co...@gmail.com> wrote:

> On Mon, Sep 12, 2016 at 5:38 PM, William A Rowe Jr <wr...@rowe-clan.net>
> wrote:
> > It really seems that if a major client is not handling "|" correctly, we
> > need to carve out an exception,
>
> +1 to allow it.
>
> For others who might hit a maze of closed/duped bug reports this one
> is active this year:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1064700
>

Makes for some disturbing reading... the amount of misinformation
is truly mind-boggling (especially if you chase down the other reports.)
Their aspirational goal of duplicating the mistakes of other the clients
speaks for the wider UA community... sigh. Firefox since 'uncorrected'
their originally correct handling of '[' and ']' to be equally out-of-spec.

But it leads to a very thorough survey of the queryargs behavior of the
major browser families which is worth reviewing;
https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6

Based on the complete mess which is queryarg behavior from all of
the browser families (my interrogations didn't cover this)... it appears
that we cannot reject any of the 'unwise'/'unsafe' set without causing
major headaches:

   hex                22  3C  3E  5B  5C  5D  5E  60  7B  7C  7D
   char               "   <   >   [   \   ]   ^   `   {   |   }

   Chrome   path      %   %   %   .   .   .   %   %   %   %   %
   +Opera   query     %   %   %   .   .   .   .   .   .   .   .

   IE       path      %   %   %   .   .   .   %   %   %   %   %
            query     .   .   .   .   .   .   .   .   .   .   .

   Firefox  path      %   %   %   %   %   %   %   %   %   .   %
            query     %   %   %   .   .   .   .   %   .   .   .

   Safari   path      %   %   %   .   .   .   .   .   .   .   .
            query     %   %   %   .   .   .   .   .   .   .   .

So I will add the entire unwise (and unmentioned, in RFC3986) set
to our URI validator. I don't particularly want to create some middle
tier 'mostly safe but unwise chars accepted' configuration option.

Internally httpd will reassemble these in the path segment, correctly
encoded per spec for Location: and back-end URI's. Because httpd
often does not decode/encode the user-provided queryargs, it will
generally pass these back or along as submitted by the client.

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by Eric Covener <co...@gmail.com>.

On Mon, Sep 12, 2016 at 5:38 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> It really seems that if a major client is not handling "|" correctly, we
> need to carve out an exception,

+1 to allow it.

For others who might hit a maze of closed/duped bug reports this one
is active this year:
https://bugzilla.mozilla.org/show_bug.cgi?id=1064700

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Mon, Sep 12, 2016 at 3:06 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Mon, Sep 12, 2016 at 10:49 AM, William A Rowe Jr <wr...@rowe-clan.net>
> wrote:
>
>> On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rp...@apache.org>
>> wrote:
>>
>>>
>>> On 08/29/2016 06:25 PM, William A Rowe Jr wrote:
>>> > Thanks all for the feedback. Status and follow-up questions inline
>>> >
>>> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <
>>> wrowe@rowe-clan.net <ma...@rowe-clan.net>> wrote:
>>> >
>>> >     4. Should the next 2.4/2.2 releases default to Strict[URI] at all?
>>> >
>>> >     Real world direct observation especially appreciated from actual
>>> deployments.
>>> >
>>> > Strict (and StrictURI) remain the default.
>>>
>>> StrictURI as a default only makes sense if we have our own house in
>>> order (see above), otherwise it should be opt in.
>>
>>
>> So it's not only our house [our %3B encoding in httpd isn't a showstopper
>> here]... but also whether widely used user-agent browsers and tooling
>> have
>> their houses in order, so I started to study the current browser
>> behaviors.
>> The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3
>>
>
> The character '|' is also invalid. However, Firefox fails to follow the
>> spec
>> again here (although Chrome gets it right).
>>
>> With respect to these characters, recall this 18 year old document,
>> last paragraph describes the rational;
>> https://tools.ietf.org/html/rfc2396.html#section-2.4.3
>>
>>    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>>
>>    Data corresponding to excluded characters must be escaped in order to
>>    be properly represented within a URI.
>>
>>
>> While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission
>> from RFC3986, the 'must' designation couldn't have been any clearer.
>> We've had this right for 2 decades at httpd.
>>
>> Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1
>> goes into some detail about this change, and while it is hard to parse,
>> the paragraph is stating that '[' ']' were once invalid, now are reserved,
>> and remain disallowed in all other path segments and use cases.
>>
>> The upshot, right now StrictURI will accept '[' and ']', but this won't survive
>> a rewrite of the apr parser operating with a 'strict' toggle. StrictURI does
>> not accept '|'. The remaining question is what to do, if anything, about
>> carving a specific exception here due to modern Firefox issues.
>>
>> Thoughts/Comments/Additional test data?  TIA!
>>
>>
It really seems that if a major client is not handling "|" correctly, we
need to
carve out an exception, as well as disallow the "#" fragment gen-delim which
is not allowed to be presented. e.g.;

--- server/gen_test_char.c (revision 1760444)
+++ server/gen_test_char.c (working copy)
@@ -143,10 +143,11 @@
          * and unreserved (2.3) that are possible somewhere within a URI.
          * Spec requires all others to be %XX encoded, including obs-text.
          */
-        if (c && (strchr("%"                              /* pct-encode */
-                         ":/?#[]@"                        /* gen-delims */
-                         "!$&'()*+,;="                    /* sub-delims */
-                         "-._~", c) || apr_isalnum(c))) { /* unreserved */
+        if (c && (strchr("%"                           /* pct-encode */
+                         ":/?[]@"                      /* gen-delims - "#"
*/
+                         "!$&'()*+,;="                 /* sub-delims */
+                         "-._~"                        /* unreserved */
+                         "|", c) || apr_isalnum(c))) { /* permit firefox
bug */
             flags |= T_URI_RFC3986;
         }


so my only remaining question is what of the others in the not-mentioned,
entirely invalid set? <"> | "<" | ">" | "\" | "^" | "`" | "{" | "}" ... so
far the modern
browsers reviewed handle these correctly, but if anyone has old browsers
still
installed for testing/validation, double checking the test queries would be
a big
help still, as well as confirming on Safari, Dolphin, etc.

Are we ok with adding one invalid exception for firefox to StrictURI (and
later,
two more "[" "]" when we code segment-by-segment validation into apr) while
still disallowing the rest of this list?

Re: StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Mon, Sep 12, 2016 at 10:49 AM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rp...@apache.org> wrote:
>
>>
>> On 08/29/2016 06:25 PM, William A Rowe Jr wrote:
>> > Thanks all for the feedback. Status and follow-up questions inline
>> >
>> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <
>> wrowe@rowe-clan.net <ma...@rowe-clan.net>> wrote:
>> >
>> >     4. Should the next 2.4/2.2 releases default to Strict[URI] at all?
>> >
>> >     Real world direct observation especially appreciated from actual
>> deployments.
>> >
>> > Strict (and StrictURI) remain the default.
>>
>> StrictURI as a default only makes sense if we have our own house in order
>> (see above), otherwise it should be opt in.
>
>
> So it's not only our house [our %3B encoding in httpd isn't a showstopper
> here]... but also whether widely used user-agent browsers and tooling have
> their houses in order, so I started to study the current browser
> behaviors.
> The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3
>

The second test below has been updated with 2 and 3 byte utf-8 sequences,
and see no new surprises showed up.

Checked the unreserved set with '?' and '/' observing special meanings.
Nothing here should become escaped when given as a URI;
http://localhost:8080/unreserved-._~/sub-delims-!$&'
()*+,;=/gen-delims-:@?query

Checked the invalid set of characters all of which must be encoded
per the spec, and verify that #frag is not passed to the server;
http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}§‡#frag

Checked the reserved set including '#' '%' '?' by their encoded value
to determine if there are any unpleasant reverse surprises lurking;
http://localhost:8080/encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D

Checked a list of unreserved/unassigned gen-delims and sub-delims
to determine if the user agent normalizes while composing the request;
http://localhost:8080/plain-%21%24%26%27%28%29%2A%2B%2C%2D%
2E%31%32%33%41%42%43%5F%61%62%63%7E

Using the simplistic $ nc -kl localhost 8080 here are the results
I obtained from a couple of current browsers. More observations and
feedback
of other user-agents to this list would be appreciated.

Chrome 53:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1
 HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^

Firefox 48:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1
odd>            ^^                     ^         ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D%C2%A7%E2%80%A1
 HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E
HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^
 ^

IE 11:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1
HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^



> The character '\' is converted to a '/' by both browsers, in a nod either
> to Microsoft insanity, or a less-accessible '/' key. (Which suggests that
> the yen sign might be treated similarly in some jp locales.) Invalid as a
> literal '\' character, both browsers support an explicit %5C for those who
> really want to use that in a URI. No actual issue here.
>

Ditto for Microsoft IE.


> Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3
> grammer
> (as I've tested above), while '[' and ']' are omitted and therefore not
> allowed
> according to spec. (On this, StrictURI won't care yet, because we are
> simply correcting for any valid URI character, not by section, and '[' ']'
> are
> obviously allowed for the IPv6 port specification - so we don't reject
> yet.)
> When we add strict parsing to the apr uri parsing function, we will trip
> over this, from all browsers, in spite of these being prohibited and
> declared
> unwise for the past 18 years or more.
>

IE also suffers the '[' ']' defect above, and does not share the
Firefox-specific
defect of '|' below. In short Chrome and IE behavior appears to be
identical
over the wire.


> The character '|' is also invalid. However, Firefox fails to follow the
> spec
> again here (although Chrome gets it right).
>
> With respect to these characters, recall this 18 year old document,
> last paragraph describes the rational;
> https://tools.ietf.org/html/rfc2396.html#section-2.4.3
>
>    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
>    Data corresponding to excluded characters must be escaped in order to
>    be properly represented within a URI.
>
>
> Which replaced https://tools.ietf.org/html/rfc1738#section-2.2 now
> almost 22 years old, without changing the rules;
>
>    Unsafe:
>
>    Characters can be unsafe for a number of reasons.  The space
>    character is unsafe because significant spaces may disappear and
>    insignificant spaces may be introduced when URLs are transcribed or
>    typeset or subjected to the treatment of word-processing programs.
>    The characters "<" and ">" are unsafe because they are used as the
>    delimiters around URLs in free text; the quote mark (""") is used to
>    delimit URLs in some systems.  The character "#" is unsafe and should
>    always be encoded because it is used in World Wide Web and in other
>    systems to delimit a URL from a fragment/anchor identifier that might
>    follow it.  The character "%" is unsafe because it is used for
>    encodings of other characters.  Other characters are unsafe because
>    gateways and other transport agents are known to sometimes modify
>    such characters. These characters are "{", "}", "|", "\", "^", "~",
>    "[", "]", and "`".
>
>    All unsafe characters must always be encoded within a URL.
>
>
> While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission
> from RFC3986, the 'must' designation couldn't have been any clearer.
> We've had this right for 2 decades at httpd.
>
> Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1
> goes into some detail about this change, and while it is hard to parse,
> the paragraph is stating that '[' ']' were once invalid, now are reserved,
> and remain disallowed in all other path segments and use cases.
>
> The upshot, right now StrictURI will accept '[' and ']', but this won't
> survive
> a rewrite of the apr parser operating with a 'strict' toggle. StrictURI
> does
> not accept '|'. The remaining question is what to do, if anything, about
> carving a specific exception here due to modern Firefox issues.
>
> Thoughts/Comments/Additional test data?  TIA!
>
>