You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hc.apache.org by Oleg Kalnichevski <ol...@apache.org> on 2020/07/10 09:04:21 UTC

RFC3986 comformance saga

Folks

Some of you might have followed the RFC3986 Soap Opera in JIRA (with a
very special guest moreover), with interest, or not.

Anyway, here's what I can propose as an immediate step. I will invest
time into making _our_ code and protocol logic conformant to RFC3986
(Roy's interpretation of it, that is) and put it into 5.1.x branch for
review.

We will end up with a RFC3986 / RFC2396 hybrid, at least initially. At
a later point someone else would be very welcome to work on HTTPCORE-
637

https://issues.apache.org/jira/browse/HTTPCORE-637

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org

Re: RFC3986 comformance saga

Posted by Michael Osipov <mi...@apache.org>.

Am 2020-07-26 um 23:17 schrieb Mark Mielke:
> On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <ol...@apache.org> wrote:
> 
>> Please find a few minutes to review the changes I am proposing to make
>> HttpCore partially RFC3986 conformant in the 5.1 branch:
>>
>> https://github.com/apache/httpcomponents-core/pull/205
>>
>> There are two major differences to the behavior of HttpCore 5.0.x and
>> HttpClient 4.5.x:
>>
>> 1. percent-encoding is applied to all unreserved characters whenever or
>> not some of those characters are explicitly permitted for use in URI
>> components, for instance `&` character in path segments.
>>
> 
> Did you mean "is applied to all *reserved* characters whenever or not some
> of these characters are explicitly permitted for use in URI components"?

I think this is a typo. Look at the code, all unreserved chars are 
passed as-is.

> I am trying to understand the impact from the test cases, as well as the
> above statement, I'm lead to the conclusion that you are changing the "URI
> producing" code, that will have a side effect of changing the normalization
> code. Is this a correct assessment? If so, I'm wondering whether this leads
> to the opposite problem?
> 
> RFC 3986 doesn't say that one should always percent-encode every reserved
> character. As I think you pointed out, it says:
> 
>     URI producing applications should percent-encode data octets that
>     correspond to characters in the reserved set *unless these characters
>     are specifically allowed by the URI scheme to represent data in that
>     component.*  If a reserved character is found in a URI component and
>     no delimiting role is known for that character, then it must be
>     interpreted as representing the data octet corresponding to that
>     character's encoding in US-ASCII.
> 
> 
> This "should ... unless ... specifically allowed by ..." means that they
> don't need to be percent-encoded, and perhaps they shouldn't automatically
> be encoded. "Should" is soft, in that URI producing either way should be
> legal, but I think your original perspective that they "should not" is
> correct as the default expectation.
> 
> I think it is useful to have URI producing classes that can specify whether
> something is a "path component" or a "path", and in the case of a "path
> component", it would automatically percent-encode the "/" reserved
> character and such. If my read of the change is that this capability is
> being introduced, then I do like it. Similarly, I think there should be a
> way to add components without normalization or percent-encoding, although
> for the most part people use StringBuilder or similar to achieve this end
> today.

RFC 7230 does not deviate from RFC 3986: 
https://github.com/apache/httpcomponents-core/pull/205#issuecomment-665557772
There is no special handling for HTTP, as far as I understand.

> The original concern for me wasn't about URI producing. The concern was
> that URI normalization, which was being applied by default, was changing
> the URI in a way that was not necessary, and that was not considered
> "normalization" per RFC 3986 and the expectations of at least some people
> and implementations:
> 
>     The purpose of reserved characters is to provide a set of delimiting
>     characters that are distinguishable from other data within a URI.
>     URIs that differ in the replacement of a reserved character with its
>     corresponding percent-encoded octet are not equivalent.  Percent-
>     encoding a reserved character, or decoding a percent-encoded octet
>     that corresponds to a reserved character, will change how the URI is
>     interpreted by most applications.  *Thus, characters in the reserved
>     set are protected from normalization and are therefore safe to be
>     used by scheme-specific and producer-specific algorithms for
>     delimiting data subcomponents within a URI.*
> 
> 
> Whether you or I might argue that they "should" or "should not" be
> equivalent from a URI producing perspective, there is some expectation that
> reasonable people and existing implementations which we need to
> inter-operate with might disagree, and the formal documentation is now
> explicit in RFC 3986 that the characters  in the reserved set are protected
> from normalization and therefore safe to use by scheme-specific and
> producer-specific algorithms for delimiting data subcomponents within a
> URI. This isn't saying it "should" or "shouldn't" be done. It's saying that
> it is recognized that it is being done, so any method of normalization
> needs to be cautious.
> 
> The expectation for me, is that a user of an HTTP client, that receives a
> redirect or other external source, should not automatically apply
> unnecessary normalization that might change the meaning of the URI to the
> producer. It means acknowledging that the producer is authoritative for
> whether or not the reserved characters should be encoded for their use
> case, and it is not permitted for Apache HTTP Client to transform the URI
> to mean something different in this case.

I don't see any normalization code. It just encodes everything which is 
not safe.

> For example, according to RFC 3986 I would expect:
> 
>      http://acme.com/foo&bar     to be normalized to:
> http://acme.com/foo&bar
> 
> And:
> 
>      http://acme.com/foo%26bar     to be normalized to:
> http://acme.com/foo%26bar
> 
> Does the proposed code result in this expectation being met? Or does it
> always percent-encode, leading to the opposite problem, that normalization
> is now potentially breaking applications that presume the '&' will be left
> intact in the path segment?

Why don't you try?

> I still find RFC 3986 terribly inconsistent and confusing but I suppose
>> I am just not smart enough for it.
>>
> 
> I wonder if URI producing and URI normalization are being conflated, and
> this is the crucial point to resolving the confusion from both perspectives?

I think the code complies to 
https://tools.ietf.org/html/rfc7230#section-2.7.3

It will not encode "/mama/" to "/%6d%61%6d%61/".

M


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org

Re: RFC3986 comformance saga

Posted by Mark Mielke <ma...@gmail.com>.

On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <ol...@apache.org> wrote:

> Please find a few minutes to review the changes I am proposing to make
> HttpCore partially RFC3986 conformant in the 5.1 branch:
>
> https://github.com/apache/httpcomponents-core/pull/205
>
> There are two major differences to the behavior of HttpCore 5.0.x and
> HttpClient 4.5.x:
>
> 1. percent-encoding is applied to all unreserved characters whenever or
> not some of those characters are explicitly permitted for use in URI
> components, for instance `&` character in path segments.
>

Did you mean "is applied to all *reserved* characters whenever or not some
of these characters are explicitly permitted for use in URI components"?

I am trying to understand the impact from the test cases, as well as the
above statement, I'm lead to the conclusion that you are changing the "URI
producing" code, that will have a side effect of changing the normalization
code. Is this a correct assessment? If so, I'm wondering whether this leads
to the opposite problem?

RFC 3986 doesn't say that one should always percent-encode every reserved
character. As I think you pointed out, it says:

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set *unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.*  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.

This "should ... unless ... specifically allowed by ..." means that they
don't need to be percent-encoded, and perhaps they shouldn't automatically
be encoded. "Should" is soft, in that URI producing either way should be
legal, but I think your original perspective that they "should not" is
correct as the default expectation.

I think it is useful to have URI producing classes that can specify whether
something is a "path component" or a "path", and in the case of a "path
component", it would automatically percent-encode the "/" reserved
character and such. If my read of the change is that this capability is
being introduced, then I do like it. Similarly, I think there should be a
way to add components without normalization or percent-encoding, although
for the most part people use StringBuilder or similar to achieve this end
today.

The original concern for me wasn't about URI producing. The concern was
that URI normalization, which was being applied by default, was changing
the URI in a way that was not necessary, and that was not considered
"normalization" per RFC 3986 and the expectations of at least some people
and implementations:

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  *Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.*

Whether you or I might argue that they "should" or "should not" be
equivalent from a URI producing perspective, there is some expectation that
reasonable people and existing implementations which we need to
inter-operate with might disagree, and the formal documentation is now
explicit in RFC 3986 that the characters  in the reserved set are protected
from normalization and therefore safe to use by scheme-specific and
producer-specific algorithms for delimiting data subcomponents within a
URI. This isn't saying it "should" or "shouldn't" be done. It's saying that
it is recognized that it is being done, so any method of normalization
needs to be cautious.

The expectation for me, is that a user of an HTTP client, that receives a
redirect or other external source, should not automatically apply
unnecessary normalization that might change the meaning of the URI to the
producer. It means acknowledging that the producer is authoritative for
whether or not the reserved characters should be encoded for their use
case, and it is not permitted for Apache HTTP Client to transform the URI
to mean something different in this case.

For example, according to RFC 3986 I would expect:

    http://acme.com/foo&bar     to be normalized to:
http://acme.com/foo&bar

And:

    http://acme.com/foo%26bar     to be normalized to:
http://acme.com/foo%26bar

Does the proposed code result in this expectation being met? Or does it
always percent-encode, leading to the opposite problem, that normalization
is now potentially breaking applications that presume the '&' will be left
intact in the path segment?

I still find RFC 3986 terribly inconsistent and confusing but I suppose
> I am just not smart enough for it.
>

I wonder if URI producing and URI normalization are being conflated, and
this is the crucial point to resolving the confusion from both perspectives?

-- 
Mark Mielke <ma...@gmail.com>

Re: RFC3986 comformance saga

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Fri, 2020-07-10 at 11:04 +0200, Oleg Kalnichevski wrote:
> Folks
> 
> Some of you might have followed the RFC3986 Soap Opera in JIRA (with
> a
> very special guest moreover), with interest, or not.
> 
> Anyway, here's what I can propose as an immediate step. I will invest
> time into making _our_ code and protocol logic conformant to RFC3986
> (Roy's interpretation of it, that is) and put it into 5.1.x branch
> for
> review.
> 
> We will end up with a RFC3986 / RFC2396 hybrid, at least initially.
> At
> a later point someone else would be very welcome to work on HTTPCORE-
> 637
> 
> https://issues.apache.org/jira/browse/HTTPCORE-637
> 
> Oleg
> 
> 

Folks.

Please find a few minutes to review the changes I am proposing to make
HttpCore partially RFC3986 conformant in the 5.1 branch:  

https://github.com/apache/httpcomponents-core/pull/205

There are two major differences to the behavior of HttpCore 5.0.x and
HttpClient 4.5.x:

1. percent-encoding is applied to all unreserved characters whenever or
not some of those characters are explicitly permitted for use in URI
components, for instance `&` character in path segments.

2. Host in URI authority component can have characters beyond those
permitted by DNS syntax. Host can now be empty.

I still find RFC 3986 terribly inconsistent and confusing but I suppose
I am just not smart enough for it. 

So, please, your feedback _really_ matters.

Cheers

Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org

Re: RFC3986 comformance saga

Posted by "info@flyingfischer.ch" <in...@flyingfischer.ch>.

Thanks Oleg for your great work, time and efforts in this project!

Markus

Am 10.07.20 um 11:04 schrieb Oleg Kalnichevski:
> Folks
>
> Some of you might have followed the RFC3986 Soap Opera in JIRA (with a
> very special guest moreover), with interest, or not.
>
> Anyway, here's what I can propose as an immediate step. I will invest
> time into making _our_ code and protocol logic conformant to RFC3986
> (Roy's interpretation of it, that is) and put it into 5.1.x branch for
> review.
>
> We will end up with a RFC3986 / RFC2396 hybrid, at least initially. At
> a later point someone else would be very welcome to work on HTTPCORE-
> 637
>
> https://issues.apache.org/jira/browse/HTTPCORE-637
>
> Oleg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> For additional commands, e-mail: dev-help@hc.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org