You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Jeff Dalton (JIRA)" <ji...@apache.org> on 2007/08/03 17:59:53 UTC

[jira] Created: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

URI Absolutization does not follow browser behavior
---------------------------------------------------

                 Key: HTTPCLIENT-679
                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
             Project: HttpComponents HttpClient
          Issue Type: Bug
          Components: HttpClient
    Affects Versions: 3.1 RC1
         Environment: HttpClient 3.1 RC1, 
JDK 1.6.0
Ubuntu 7.04
            Reporter: Jeff Dalton


This was encountered using Heritrix to crawl a prominent website.

The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")

Results in newUrl:
http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10

The desired behavior based on Firefox and IE should be:
http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10

These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.

HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Updated: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Roland Weber (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roland Weber updated HTTPCLIENT-679:
------------------------------------

    Fix Version/s: 3.1 Final

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>             Fix For: 3.1 Final
>
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Commented: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Ortwin Glück (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518364 ] 

Ortwin Glück commented on HTTPCLIENT-679:
-----------------------------------------

Roland,

In my opinion this is such a corner case that has hardly any relevance in the real world. I doubt that anybody else than Jeff is affected by the change. Fine for me to check this in.

Ortwin

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Commented: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Gordon Mohr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517586 ] 

Gordon Mohr commented on HTTPCLIENT-679:
----------------------------------------

Notably, the browsers are following RFC3986. Taking an example from RFC3986 section 5.4.1 ("Normal Examples"):

URI uri = new URI(new URI("http://a/b/c/d;p?q"), "?y");
uri.toString(); // is "http://a/b/c/?y"; by RFC3986 should be "http://a/b/c/d;p?y"




> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Updated: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Jeff Dalton (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Dalton updated HTTPCLIENT-679:
-----------------------------------

    Attachment: uri_fix.patch

Wow, that class is a steaming pile.  It's a miracle it works as well as it does.

Regardless, for now it is easier to fix HTTPClient than to change Heritrix over to use the java.net class (which has its own bugs to be worked around).  

Here is a patch for the URI class that changes it to follow the RFC behavior in this case.  I have updated the URI class and the corresponding test to follow the RFC.  

Oleg, it would be great if you could review this.

Roland, I firmly believe that it is better to fix a major bug that violates a basic case in the RFC than to not fix it and release a final version with a major flaw.  Feel free to add a switch to make it backwards compatible, if this is a hard requirement.

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Resolved: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Oleg Kalnichevski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oleg Kalnichevski resolved HTTPCLIENT-679.
------------------------------------------

    Resolution: Fixed

Patch checked in

Oleg

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>             Fix For: 3.1 Final
>
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Commented: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Roland Weber (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517960 ] 

Roland Weber commented on HTTPCLIENT-679:
-----------------------------------------

Actually, the suggested change has potential to break existing applications that rely on the current behavior. We can't introduce such changes between RC1 and final, unless there would be an easy way to switch back to the current behavior.

cheers,
   Roland


> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Resolved: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Oleg Kalnichevski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oleg Kalnichevski resolved HTTPCLIENT-679.
------------------------------------------

    Resolution: Won't Fix

Jeff, Gordon,

URI class in HttpClient 3.x is a complete mess none of the existing committers would touch even with a barge pole. This class has been replaced with the standard java.net.URI class in HttpClient 4.0. If you are prepared to contribute a fix for the problem I'll happily review it and check it to the repository, but I seriously doubt any of us would be willing to invest any time into fixing old URI code in HttpClient 3.x. 

Oleg

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


[jira] Reopened: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

Posted by "Oleg Kalnichevski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oleg Kalnichevski reopened HTTPCLIENT-679:
------------------------------------------


The fix looks reasonable to me. If I hear no complaints I'll check it in later this week

Oleg

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org