You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Mindaugas Žakšauskas <mi...@gmail.com> on 2011/05/09 12:32:02 UTC

Semicolon URI encoding and RFC

Hi,

I was trying to read RFCs 3986 and 2396 to understand some subtleties
about URI encoding.

In particular I am interested about whether semicolon (;) needs to be
percent escaped as, e.g. http://site/some;path VS
http://site/some%3Bpath when outputting e.g. HTML href element.

Just for interest, here's what I get in both Tomcat 6.0.26.0 and 7.0.12.0:

href URI           ((HttpServletRequest) request).getServletPath()

http://site/foo                        /foo
http://site/test1;test2             /test1
http://site/test1%3Btest2       /test1;test2
http://site/test1)test2/            /test1)test2/

According to RFC 3986, both semicolon and closing bracket ')' belongs
to sub-delims class but one needs escaping and another doesn't. Is
this expected behaviour? I have asked this question on StackOverflow,
and the answerer guessed that Tomcat is following older RFC 2396. Can
anyone clarify?

Regards,
Mindaugas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/5/11 Mindaugas Žakšauskas <mi...@gmail.com>:
> On Tue, May 10, 2011 at 7:31 PM, Christopher Schultz
> <ch...@christopherschultz.net> wrote:
>> <..>
>> What about http://site/test1%28test2/
>>
>> Does that give you "/test1)test2/"?
>
> Closing bracket is %29 but yes, it does.
>
>> If so, Tomcat is probably following SOP with regard to standards which
>> is to be conservative in what you send and liberal in what you accept.
>
> Which is all good and understandable. But I would like to print <a
> href="/test1)test2"> rather than <a href="/test1%29test2"> for better
> readability as now I don't know what requires to be encoded and what
> not.

readability? nobody reads the HTML source

That is why there are textual markup languages there used for wiki etc. :

HTML is too hard to read for an average person.


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
Christopher Schultz> So, you want to /only/ escape those entities that
are /absolutely
Christopher Schultz> required/ to be escaped?

Yes.

Christopher Schultz> I'm not sure anyone really cares what URLs look
like, do they?
Konstantin Kolinko> readability? nobody reads the HTML source

Search engine bots do [1]. Some people (especially those who care
about SEO) do. This code is written for a CMS and I want to make sure
I won't get any customers coming back to me in a year asking to fix it
again - after me spending days reading RFCs and asking people on the
forums.

Christopher Schultz> And if they do, why not change them so this escaping thing
Christopher Schultz> isn't necessary?

I can't. This is a CMS where people can create pages having arbitrary
characters. And the system needs to print links to these pages in e.g.
pre-packaged site map component.

Regards,
Mindaugas

1. http://lmgtfy.com/?q=search+friendly+uri

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mindaugas,

On 5/11/2011 9:16 AM, Mindaugas Žakšauskas wrote:
> On Tue, May 10, 2011 at 7:31 PM, Christopher Schultz
> <ch...@christopherschultz.net> wrote:
>> <..>
>> What about http://site/test1%28test2/
>>
>> Does that give you "/test1)test2/"?
> 
> Closing bracket is %29 but yes, it does.
> 
>> If so, Tomcat is probably following SOP with regard to standards which
>> is to be conservative in what you send and liberal in what you accept.
> 
> Which is all good and understandable. But I would like to print <a
> href="/test1)test2"> rather than <a href="/test1%29test2"> for better
> readability as now I don't know what requires to be encoded and what
> not.

So, you want to /only/ escape those entities that are /absolutely
required/ to be escaped? I'm not sure anyone really cares what URLs look
like, do they? And if they do, why not change them so this escaping
thing isn't necessary?

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3KjxAACgkQ9CaO5/Lv0PBgFACgp07NpN3tRcYrrygztGXfreuO
vIgAn3dCeJ1YXPP0bLWwxFKobDhmBaok
=pOa2
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
On Tue, May 10, 2011 at 7:31 PM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> <..>
> What about http://site/test1%28test2/
>
> Does that give you "/test1)test2/"?

Closing bracket is %29 but yes, it does.

> If so, Tomcat is probably following SOP with regard to standards which
> is to be conservative in what you send and liberal in what you accept.

Which is all good and understandable. But I would like to print <a
href="/test1)test2"> rather than <a href="/test1%29test2"> for better
readability as now I don't know what requires to be encoded and what
not.

Regards,
Mindaugas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mindaugas,

On 5/9/2011 6:32 AM, Mindaugas Žakšauskas wrote:
> http://site/test1)test2/            /test1)test2/

What about http://site/test1%28test2/

Does that give you "/test1)test2/"?

If not, something is probably wrong.

If so, Tomcat is probably following SOP with regard to standards which
is to be conservative in what you send and liberal in what you accept.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3JhI0ACgkQ9CaO5/Lv0PDdvACcDzyVvQuvZTYGisvROQlcRyxS
8eUAoKOc2SzH4o2id6nhlSTqulRMq7cG
=Ixrh
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
Hi,

Thanks very much for your answers. Just for a reference, I will sum up
what I've managed to get out of this discussion. Please correct me if
I am wrong.

My problem wasn't charset incompatibility between client and server as
it is the same party which produces URLs and consumes them (and yes,
we do use UTF-8 everywhere and have useBodyEncodingForURL set to
true). Anyway, it was interesting read to get the whole picture,
including Punycode. I hope others did benefit from this, too.

What I wanted to clarify was the exact sets of characters needing %
encoding. Initially I thought that this all boils down to different
character classes but it turned out to be incorrect (the semicolon VS
bracket case).

My another concern was i18zed paths, and it was a good advice from
Konstantin to have a look at Wikipedia. For example, a link to
"botánico" in Spanish Wikipedia is printed as <a
href="/wiki/Bot%C3%A1nica" title="Botánica"> and browsers are seem to
be able to show it percent-decoded without any special effort. I only
slipped here because initially I have used [1] which does not encode
(at least) some characters correctly. I ended up using modified
java.net.URI::appendEncoded(StringBuilder, char) as it's private there
and doesn't escape semicolons [2].

My conclusion is to percent-encode everything that is not unreserved.
It might be sub-optimal as some characters, such as brackets, do not
need encoding, but I better choose safe than sorry.

[1] http://stackoverflow.com/questions/573184/java-convert-string-to-valid-uri-object/3332864#3332864
[2] The final code that does the escaping:

    private static final String UNRESERVED =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890-._~";

    private final static char[] hexDigits = {'0', '1', '2', '3', '4',
'5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};

    // stolen from java.net.URI and modified to ensure semicolons,
etc. get encoded
    private static void appendEncoded(StringBuilder sb, char c) {
        ByteBuffer bb = null;
        try {
            bb =
ThreadLocalCoders.encoderFor("UTF-8").encode(CharBuffer.wrap("" + c));
        } catch (CharacterCodingException x) {
            assert false;
        }
        while (bb.hasRemaining()) {
            int b = bb.get() & 0xff;
            sb.append('%');
            sb.append(hexDigits[(b >> 4) & 0x0f]);
            sb.append(hexDigits[(b) & 0x0f]);
        }
    }

    // to escape, one needs to iterate over all characters and escape if
    // !isUnreserved(yourChar)
    private static boolean isUnreserved(char c) {
        return UNRESERVED.indexOf(c) != -1;
    }

Regards,
Mindaugas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by André Warnier <aw...@ice-sa.com>.
Hi.

This whole question is a pain in the a.. , and I personally do not understand how a 
million marketing people can be talking of "web 2.0" and "web 3.0", but not have been able 
to come out with HTTP 2.0 where URLs (and everything else) would be by default 
Unicode/UTF-8 instead of ASCII and/or ISO-latin-1.

But things being what they are, to answer your question to the best of my abilities, and 
trying to avoid jargon and twisted language :
- Basically, a URL "in transit" between a client and a server, should contain only *bytes* 
with individual byte values between 0 and 127 decimal.
Thus when it is about to send a URL to a server, any client should examine the URL 
byte-by-byte, and if any of these bytes would be outside the 0-127 range, it should 
replace it by a 3-byte sequence %xy, where xy is the hexadecimal representation of the 
byte value.
And then there are some additional rules for some of the bytes 0-127, which either forbid 
them in a URL, or also specify that you have to encode them with the %xy logic, or 
differently (like a space encoded as a "+", and a "+" encoded as %xy), and/or when (as 
Konstantin explains below for the ";").

At the server side, the first thing which the server should do with this URL, is to make 
the inverse translation : examine the URL and replace any %xy sequence by the single byte 
value which this sequence represented in transit (and "+" by space).

And /then/ starts the circus.

Because there is nothing in the RFCs that would enable the server to know, after this 
URL-decoding, in which character set the client expressed this URL.

So basically, the interpretation of at least part of the URL falls to the server-side 
application, and the client is supposed to send "the right thing" so that the application 
does not get confused. And there is no real way for the server to force the client to do 
the right thing.
And if either side does not respect whatever convention they have between them, one of the 
sides will get confused.

To my knowledge, there exists no Internet RFC which contradicts what I am writing above.
It is a definite hole in the specs, and one which nowadays is costing a lot of time being 
lost in confusion and half-way patching attempts (*).
I can understand that when HTTP 1.0 was first defined 15 years ago now, this was a 
perfectly valid position to take.  But I personally do not understand why nowadays, 15 
years and 100 million worldwide webservers later, and now that Unicode/UTF-8 support is 
ubiquitous, we are still at the same point.


(*) such as IE's "always send URLs as UTF-8", and Tomcat's "useBodyEncodingForURL" hacks.




Mindaugas Žakšauskas wrote:
> On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
> <kn...@gmail.com> wrote:
> <..>
>> If ";" is part of the actual path, it must be escaped.
>>
>> If ";" starts a "path parameter" it must be unescaped. One well-known
>> example is ";jsessionid" path parameter.
> 
> Thanks for your answer. Is this rule is just "de facto" rule, or is it
> documented anywhere in RFC3986/RFC2396?
> 
> Extending my question, is there a clear criteria which would define
> which characters always need escaping and which don't? At the moment I
> am escaping everything that is not unreserved [1], but I am not sure
> about SEOability and user-friendliness - this especially concerns path
> with international characters in URLs, e.g. http://site/pathąčęė
> 
> I have also found a similar Tomcat bug [2], but it is addressing
> slightly different issue.
> 
> If anyone wants to benefit this, I have just added 50 bonus points to
> my SO question [3]. The main question I want to get answer for is -
> which characters can and which need escaping, both in terms of RFC and
> Tomcat.
> 
> Regards,
> Mindaugas
> 
> 1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
> 2. https://issues.apache.org/bugzilla/show_bug.cgi?id=51132
> 3. http://stackoverflow.com/questions/5913623/rfc3986-which-pchars-need-to-be-percent-encoded
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by André Warnier <aw...@ice-sa.com>.
Caldarale, Charles R wrote:
>> From: André Warnier [mailto:aw@ice-sa.com] 
>> Subject: Re: Semicolon URI encoding and RFC
> 
>> The "site" (or hostname) part of the URL is submitted to a different 
>> encoding than the path part (/pathąčęė).  The path part must be URL-
>> encoded, but for the hostname part, what is used is "punycode", see 
>> http://en.wikipedia.org/wiki/Punycode.
>> Just another example of the current mess with character sets and encodings...
> 
> My brain 'urts.
> 
> http://www.youtube.com/watch?v=tqyxXX3Ra4A
> 
Yep. And on the same page, there's another one maybe more to the point : prejudice.
Any (other) Belgians on this forum ?
:-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: Semicolon URI encoding and RFC

Posted by "Caldarale, Charles R" <Ch...@unisys.com>.
> From: André Warnier [mailto:aw@ice-sa.com] 
> Subject: Re: Semicolon URI encoding and RFC

> The "site" (or hostname) part of the URL is submitted to a different 
> encoding than the path part (/pathąčęė).  The path part must be URL-
> encoded, but for the hostname part, what is used is "punycode", see 
> http://en.wikipedia.org/wiki/Punycode.
> Just another example of the current mess with character sets and encodings...

My brain 'urts.

http://www.youtube.com/watch?v=tqyxXX3Ra4A

 - Chuck


THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.


Re: Semicolon URI encoding and RFC

Posted by André Warnier <aw...@ice-sa.com>.
Konstantin Kolinko wrote:
..
> 
> 2011/5/9 André Warnier <aw...@ice-sa.com>:
>> (like a space encoded as a "+", and a "+"
>> encoded as %xy),
> 
> Andre, one small correction:
> It sometimes causes confusion, but encoding of space as '+' works only
> in the query part of the URL.
> The unambiguous way to encode a space regardless of is position in URL is %20.
> 
> Encoding space as '+' is defined by "url encoding" encoding scheme
> defined by HTML standard, in the chapter where it describes how HTML
> forms are submitted.
> 
Agreed, my mistake.
Also, in the query string part, an unencoded ";" could be taken as a query parameter 
separator, no ?  (an alternative to "&").
But I forget what RFC that is, if any.

Now one additional comment. You said :
..
 > about SEOability and user-friendliness - this especially concerns path
 > > with international characters in URLs, e.g. http://site/pathąčęė

That is up to the browser how to show those URLs. Many browsers have a
setting how to display such URLs.  E.g. try to browse non-English
Wikipedia for an example of i18n addresses.
..

I think that the above is a bit confusing.
The "site" (or hostname) part of the URL is submitted to a different encoding than the 
path part (/pathąčęė).  The path part must be URL-encoded, but for the hostname part, what 
is used is "punycode", see http://en.wikipedia.org/wiki/Punycode.
Just another example of the current mess with character sets and encodings...

I guess one has to have a first or last name containing so-called "diacritic" characters 
to really appreciate these issues.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/5/9 Mindaugas Žakšauskas <mi...@gmail.com>:
> On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
> <kn...@gmail.com> wrote:
> <..>
>> If ";" is part of the actual path, it must be escaped.
>>
>> If ";" starts a "path parameter" it must be unescaped. One well-known
>> example is ";jsessionid" path parameter.
>
> Thanks for your answer. Is this rule is just "de facto" rule, or is it
> documented anywhere in RFC3986/RFC2396?

As you wrote, it is RFC 3986, per [1]
http://tools.ietf.org/html/rfc3986

> Extending my question, is there a clear criteria which would define
> which characters always need escaping and which don't? At the moment I
> am escaping everything that is not unreserved [1], but I am not sure
> about SEOability and user-friendliness - this especially concerns path
> with international characters in URLs, e.g. http://site/pathąčęė

That is up to the browser how to show those URLs. Many browsers have a
setting how to display such URLs.  E.g. try to browse non-English
Wikipedia for an example of i18n addresses.

> I have also found a similar Tomcat bug [2], but it is addressing
> slightly different issue.

[2] is not a bug. It is an invalid report. It is a useful reading, though.

> If anyone wants to benefit this, I have just added 50 bonus points to
> my SO question [3]. The main question I want to get answer for is -
> which characters can and which need escaping, both in terms of RFC and
> Tomcat.

> 1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
> 2. https://issues.apache.org/bugzilla/show_bug.cgi?id=51132
> 3. http://stackoverflow.com/questions/5913623/rfc3986-which-pchars-need-to-be-percent-encoded

BTW, take a look at the java.net.URI class and its URI.toString() and
URI.toURL() methods.

Just one example (not 100% related to your case, but one that happens
frequently):
to converts a File to a proper URL the correct code is to call

File.toURI().toURL()

because that takes care of % encodings, while the old File.toURL()
method does not.


2011/5/9 André Warnier <aw...@ice-sa.com>:
> (like a space encoded as a "+", and a "+"
> encoded as %xy),

Andre, one small correction:
It sometimes causes confusion, but encoding of space as '+' works only
in the query part of the URL.
The unambiguous way to encode a space regardless of is position in URL is %20.

Encoding space as '+' is defined by "url encoding" encoding scheme
defined by HTML standard, in the chapter where it describes how HTML
forms are submitted.


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
<kn...@gmail.com> wrote:
<..>
> If ";" is part of the actual path, it must be escaped.
>
> If ";" starts a "path parameter" it must be unescaped. One well-known
> example is ";jsessionid" path parameter.

Thanks for your answer. Is this rule is just "de facto" rule, or is it
documented anywhere in RFC3986/RFC2396?

Extending my question, is there a clear criteria which would define
which characters always need escaping and which don't? At the moment I
am escaping everything that is not unreserved [1], but I am not sure
about SEOability and user-friendliness - this especially concerns path
with international characters in URLs, e.g. http://site/pathąčęė

I have also found a similar Tomcat bug [2], but it is addressing
slightly different issue.

If anyone wants to benefit this, I have just added 50 bonus points to
my SO question [3]. The main question I want to get answer for is -
which characters can and which need escaping, both in terms of RFC and
Tomcat.

Regards,
Mindaugas

1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
2. https://issues.apache.org/bugzilla/show_bug.cgi?id=51132
3. http://stackoverflow.com/questions/5913623/rfc3986-which-pchars-need-to-be-percent-encoded

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Semicolon URI encoding and RFC

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/5/9 Mindaugas Žakšauskas <mi...@gmail.com>:
> Hi,
>
> I was trying to read RFCs 3986 and 2396 to understand some subtleties
> about URI encoding.
>
> In particular I am interested about whether semicolon (;) needs to be
> percent escaped as, e.g. http://site/some;path VS
> http://site/some%3Bpath when outputting e.g. HTML href element.
>

If ";" is part of the actual path, it must be escaped.

If ";" starts a "path parameter" it must be unescaped. One well-known
example is ";jsessionid" path parameter.



Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org