You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hc.apache.org by Mike Moran <mi...@mac.com> on 2003/01/29 21:10:40 UTC

Relative URIs strike again

I've been looking into an issue in some other, non HttpClient, code 
regarding relative URIs, and I was wondering how HttpClient handles it. 
Specifically, it is the following case:

Base: http://www.foo.com/
Relative: #

Now, I've only often seen this as a `fake' url for javascript popups, 
but nevertheless, it is not uncommon in the wild.

I've attached an example, hash.html, to illucidate. It contains three 
relative links: "" (nothing), "#" and "#anchor". Now, assuming a base 
ref of "file:///hash.html", this is what I find:

IE 5.0:
"file:///"
"file:///hash.html#"
"file:///hash.html#anchor"

Phoenix 0.5:
"file:///hash.html"
"file:///hash.html#"
"file:///hash.html#anchor"

My code:
"file:///hash.html"
"file:///hash.html"
"file:///hash.html#anchor"

I *suspect* from reading the HttpClient code, that it does the same as 
the last one, but I haven't got a working build here to test it.

I don't find the relevant rfc, rfc2396 section 5.2, totally clear on 
what to do in the case of a fragment identifier just being "#". The 
regexp given in Appendix B seems to allow for it, ie the part (#(.*))? 
will match both "#" and "#anchor". I think the tricky bit that trips 
things up is the suggested way to reassamble the URI from its parts that 
ignores the fragment entirely if it only consists of "#".

So, what way does HttpClient's URI class deal with this? Are IE and 
Phoenix/Mozilla both wrong? Answers on a postcard to ...

-- 
Mike

Re: Relative URIs strike again

Posted by Sung-Gu <je...@apache.org>.

In more thought, I was in confuse...  :(
Now, your articles are helpful to me... ;)

Thanks to you guys...

----- Original Message ----- 
From: "Michael Becke" <be...@u.washington.edu>

> These cases are partially handled as of now.
> 
> <a href="">""</a>
> Does not work, but should.  This falls into the case of abnormal URIs 
> according to the RFC.  I think Armando was working on a fix for this.

What's this?  An empty path or empty fragment?
Looks like an empty relative path to me.
It should return an URI with the base path.
Because it's the path ifself.

And more, URI class treat relative URIs... 
so I realized it's not enought to deal with only URI(URI, String)
constructor...  :(   It should be treated in parseUriReference... 
 
Sung-Gu

Re: Relative URIs strike again

Posted by Michael Becke <be...@u.washington.edu>.

> In the case of "#" it would seem this has a defined fragment component
> which is empty (with all other components undefined).

I agree.  Fortunately this case seems to be handled correctly.  I'll 
send a patch that adds a few test cases in a minute.

> 
> Presumably then, for the above case of  "#", I should be comparing the 
> URIs using URI.getURIReference() of both the resolved base and the 
> expected absolute value ie not URI.toString() of each?

Yes, URI.getURIReference() should be used to test the full URI, 
including fragment.

Mike

Re: Relative URIs strike again

Posted by Mike Moran <mi...@mac.com>.

Michael Becke wrote:

>> But what exactly is the correct behaviour? At least two major 
>> browsers implement this behaviour in a different way than you would 
>> immediately expect. This could just be a bug in them, or it could be 
>> a legitimate ambiguity. To clarify, which of the following do you 
>> interpret as the correct behaviour?
>>
>> ["base" + "rel" = "abs"]
>> A:
>> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
>> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"
>>
>> B:
>> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
>> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html#" 
>
[ ... ]

>>
>> HttpClient:
>> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
>> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"
>>
>> It would seem that HttpClient is doing A. Note that I just did a cvs 
>> update right now and didn't apply any patches, so I may have missed 
>> some code that is pending.
>
>
> This behavior looks correct. It seems that the empty case was fixed 
> yesterday.

[ ... ]

After further inspection of the rfc, it would seem B is actually 
correct. The relevant part is:

rfc2396, section 5.2
"7) ...

         if fragment is defined then
             append "#" to result
             append fragment to result

         return result

      Note that we must be careful to preserve the distinction between a
      component that is undefined, meaning that its separator was not
      present in the reference, and a component that is empty, meaning
      that the separator was present and was immediately followed by the
      next component separator or the end of the reference."

In the case of "#" it would seem this has a defined fragment component
which is empty (with all other components undefined).

>
>> Incidentally, when I tried a relative URI of "#foop" I got the 
>> following:
>>
>> "http://www.foo.com/foop.html" + "#foop" = 
>> "http://www.foo.com/foop.html"
>>
>> Surely this is incorrect?
>
>
> This is something I ran across as well.  The fragment is not dropped, 
> but is left out of the standard URI.  To get the full URI you have to 
> use URI.getURIReference(). 

Presumably then, for the above case of  "#", I should be comparing the 
URIs using URI.getURIReference() of both the resolved base and the 
expected absolute value ie not URI.toString() of each?

-- 
Mike

Re: Relative URIs strike again

Posted by Michael Becke <be...@u.washington.edu>.

> But what exactly is the correct behaviour? At least two major browsers 
> implement this behaviour in a different way than you would immediately 
> expect. This could just be a bug in them, or it could be a legitimate 
> ambiguity. To clarify, which of the following do you interpret as the 
> correct behaviour?
>
> ["base" + "rel" = "abs"]
> A:
> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"
>
> B:
> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html#"

I believe A is correct.  The RFC is a little vague(particularly when 
there is a query) in this case as it just says to use "the current 
document".

> There are more actual possible permutations than A and B but they are 
> two likely groups of choices.
>
> I wrote a little snippet[1] to test the URI behaviour and got the 
> following:
>
> HttpClient:
> "http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
> "http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"
>
> It would seem that HttpClient is doing A. Note that I just did a cvs 
> update right now and didn't apply any patches, so I may have missed 
> some code that is pending.

This behavior looks correct. It seems that the empty case was fixed 
yesterday.

> Incidentally, when I tried a relative URI of "#foop" I got the 
> following:
>
> "http://www.foo.com/foop.html" + "#foop" = 
> "http://www.foo.com/foop.html"
>
> Surely this is incorrect?

This is something I ran across as well.  The fragment is not dropped, 
but is left out of the standard URI.  To get the full URI you have to 
use URI.getURIReference().

Mike

Re: Relative URIs strike again

Posted by Mike Moran <mi...@mac.com>.

On Wednesday, January 29, 2003, at 08:42 PM, Michael Becke wrote:

> These cases are partially handled as of now.
>
> <a href="">""</a>
> Does not work, but should.  This falls into the case of abnormal URIs 
> according to the RFC.  I think Armando was working on a fix for this.

Btw, here is an example of an empty URI seen more often:
  <FORM action="" method="post">
  ...form contents...
  </FORM>

This is actually quite useful if it is returned by a Servlet as it 
means "submit back to me".

> <a href="#">"#"</a>
> <a href="#anchor">"#anchor"</a>
> Both of these cases work correctly in the most recent code.

But what exactly is the correct behaviour? At least two major browsers 
implement this behaviour in a different way than you would immediately 
expect. This could just be a bug in them, or it could be a legitimate 
ambiguity. To clarify, which of the following do you interpret as the 
correct behaviour?

["base" + "rel" = "abs"]
A:
"http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
"http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"

B:
"http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
"http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html#"

There are more actual possible permutations than A and B but they are 
two likely groups of choices.

I wrote a little snippet[1] to test the URI behaviour and got the 
following:

HttpClient:
"http://www.foo.com/foop.html" + "" = "http://www.foo.com/foop.html"
"http://www.foo.com/foop.html" + "#" = "http://www.foo.com/foop.html"

It would seem that HttpClient is doing A. Note that I just did a cvs 
update right now and didn't apply any patches, so I may have missed 
some code that is pending.

Incidentally, when I tried a relative URI of "#foop" I got the 
following:

"http://www.foo.com/foop.html" + "#foop" = 
"http://www.foo.com/foop.html"

Surely this is incorrect?

This may be due to the fact that the URI class lops off any fragment 
when you construct it eg (new 
URI("http://www.foo.com/foop.html#foop")).toString()
= http://www.foo.com/foop.html
I wrote a small test case for this[2].

Thanks,

[1]: Snippet:
public class URIMain
{
     public static void main(String[] args) throws Exception
     {
	String base = args[0];
	String rel = "";
	if (args.length >= 2) {
	    rel = args[1];
	}
	URI baseURI = new URI(base);
	URI uri = new URI(baseURI, rel);
	System.out.println("\"" + baseURI + "\" + \"" + rel + "\" = \""
			   + uri + "\"");
	System.out.println("\"" + base + "\" + \"" + rel + "\" = \""
			   + uri + "\"");
     }
}

[2]: Test case:
public class TestURI extends TestNoHost {

     ...

     public void testURIConstructorKeepsFrag() throws Exception {
	String origURIWithFrag = "http://www.foo.com/foo#frag";
	URI uri = new URI(origURIWithFrag);
	String asString = uri.toString();
	assertTrue("Doesn't drop fragment"
		   + " expected: \"" + origURIWithFrag + "\""
		   + " actual: \"" + asString + "\"",
		   origURIWithFrag.equals(asString));
     }

     ...
}

-- 
Mike

Re: Relative URIs strike again

Posted by Michael Becke <be...@u.washington.edu>.

These cases are partially handled as of now.

<a href="">""</a>
Does not work, but should.  This falls into the case of abnormal URIs 
according to the RFC.  I think Armando was working on a fix for this.

<a href="#">"#"</a>
<a href="#anchor">"#anchor"</a>
Both of these cases work correctly in the most recent code.

Mike

Mike Moran wrote:
> Mike Moran wrote:
> [ ... ]
> 
> I think a mime gobbler woke up and fed or I erred, so here is the file 
> inline:
> 
> <html>
> <head>
> </head>
> 
> <body>
> <a href="">""</a>
> <a href="#">"#"</a>
> <a href="#anchor">"#anchor"</a>
> </body>
> </html>
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: 
> commons-httpclient-dev-help@jakarta.apache.org
>

Re: Relative URIs strike again

Posted by Mike Moran <mi...@mac.com>.

Mike Moran wrote:
[ ... ]

I think a mime gobbler woke up and fed or I erred, so here is the file 
inline:

<html>
<head>
</head>

<body>
<a href="">""</a>
<a href="#">"#"</a>
<a href="#anchor">"#anchor"</a>
</body>
</html>