You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Eugeny N Dzhurinsky <bo...@redwerk.com> on 2009/03/11 22:09:09 UTC
Weird issue with '+' symbols in path?
Hello there!
I've recently stumbled over the weird issue with + symbols when using Commons
HttpClient 3.1. There is the unit test, which illustrates the problem below:
========================================================================================
import junit.framework.TestCase;
import org.apache.commons.httpclient.URI;
/**
* Tests the escaping issue in URI class
*/
public class TestURIEscaping extends TestCase {
private static final String SAMPLE_URI = "http://www.fulltiltpoker.com/hu/pro-chat-transcript/Chau+Giang/1233971932";
public void testURIEscaping() throws Exception {
URI uri = new URI(
SAMPLE_URI,
false, "latin1");
assertEquals(SAMPLE_URI, uri.toString());
}
}
========================================================================================
Surprisingly the test fails!
After I dig into the code of the URI class, I've noticed there is the such
strange code exists, like listed below:
/**
* Those characters that are allowed for the abs_path.
*/
public static final BitSet allowed_abs_path = new BitSet(256);
// Static initializer for allowed_abs_path
static {
allowed_abs_path.or(abs_path);
// allowed_abs_path.set('/'); // aleady included
allowed_abs_path.andNot(percent);
allowed_abs_path.clear('+');
}
and looks like the '+' character is always replaced with it's hex code %2B.
But as far as I remember, there is the RFC 2396
http://www.ietf.org/rfc/rfc2396.txt
and in the section
3.3. Path Component
The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority.
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters
"/", ";", "=", and "?" are reserved. Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
The parameters are not significant to the parsing of relative
references.
So as per this section, the '+' character should never be escaped! So why does
the HttpClient violates this RFC? Or I did not understand something properly?
Thank you in advance!
--
Eugene N Dzhurinsky
Re: Weird issue with '+' symbols in path?
Posted by Oleg Kalnichevski <ol...@apache.org>.
Eugeny N Dzhurinsky wrote:
> Hello there!
>
> I've recently stumbled over the weird issue with + symbols when using Commons
> HttpClient 3.1. There is the unit test, which illustrates the problem below:
>
> ========================================================================================
> import junit.framework.TestCase;
>
> import org.apache.commons.httpclient.URI;
>
> /**
> * Tests the escaping issue in URI class
> */
> public class TestURIEscaping extends TestCase {
>
> private static final String SAMPLE_URI = "http://www.fulltiltpoker.com/hu/pro-chat-transcript/Chau+Giang/1233971932";
>
> public void testURIEscaping() throws Exception {
> URI uri = new URI(
> SAMPLE_URI,
> false, "latin1");
> assertEquals(SAMPLE_URI, uri.toString());
> }
>
> }
> ========================================================================================
>
> Surprisingly the test fails!
>
> After I dig into the code of the URI class, I've noticed there is the such
> strange code exists, like listed below:
>
> /**
> * Those characters that are allowed for the abs_path.
> */
> public static final BitSet allowed_abs_path = new BitSet(256);
> // Static initializer for allowed_abs_path
> static {
> allowed_abs_path.or(abs_path);
> // allowed_abs_path.set('/'); // aleady included
> allowed_abs_path.andNot(percent);
> allowed_abs_path.clear('+');
> }
>
> and looks like the '+' character is always replaced with it's hex code %2B.
> But as far as I remember, there is the RFC 2396
>
> http://www.ietf.org/rfc/rfc2396.txt
>
> and in the section
>
> 3.3. Path Component
>
> The path component contains data, specific to the authority (or the
> scheme if there is no authority component), identifying the resource
> within the scope of that scheme and authority.
>
> path = [ abs_path | opaque_part ]
>
> path_segments = segment *( "/" segment )
> segment = *pchar *( ";" param )
> param = *pchar
>
> pchar = unreserved | escaped |
> ":" | "@" | "&" | "=" | "+" | "$" | ","
>
> The path may consist of a sequence of path segments separated by a
> single slash "/" character. Within a path segment, the characters
> "/", ";", "=", and "?" are reserved. Each path segment may include a
> sequence of parameters, indicated by the semicolon ";" character.
> The parameters are not significant to the parsing of relative
> references.
>
> So as per this section, the '+' character should never be escaped! So why does
> the HttpClient violates this RFC? Or I did not understand something properly?
>
> Thank you in advance!
>
There will be no fixes in HttpClient 3.x except for critical security
bugs. Consider migrating to HttpClient 4.0
Oleg
---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org