You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Jeff Dalton (JIRA)" <ji...@apache.org> on 2007/08/08 05:29:59 UTC

[jira] Updated: (HTTPCLIENT-679) URI Absolutization does not follow browser behavior

     [ https://issues.apache.org/jira/browse/HTTPCLIENT-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Dalton updated HTTPCLIENT-679:
-----------------------------------

    Attachment: uri_fix.patch

Wow, that class is a steaming pile.  It's a miracle it works as well as it does.

Regardless, for now it is easier to fix HTTPClient than to change Heritrix over to use the java.net class (which has its own bugs to be worked around).  

Here is a patch for the URI class that changes it to follow the RFC behavior in this case.  I have updated the URI class and the corresponding test to follow the RFC.  

Oleg, it would be great if you could review this.

Roland, I firmly believe that it is better to fix a major bug that violates a basic case in the RFC than to not fix it and release a final version with a major flaw.  Feel free to add a switch to make it backwards compatible, if this is a hard requirement.

> URI Absolutization does not follow browser behavior
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-679
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-679
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient
>    Affects Versions: 3.1 RC1
>         Environment: HttpClient 3.1 RC1, 
> JDK 1.6.0
> Ubuntu 7.04
>            Reporter: Jeff Dalton
>         Attachments: uri_fix.patch
>
>
> This was encountered using Heritrix to crawl a prominent website.
> The URI resulting from the HttpClient URI constructor (base, relative) does not follow browser behavior:
> URI newUrl = new URI(new URI("http://www.theirwebsite.com/browse/results?type=browse&att=1"), "?sort=0&offset=11&pageSize=10")
> Results in newUrl:
> http://www.theirwebsite.com/browse/?sort=0&offset=11&pageSize=10
> The desired behavior based on Firefox and IE should be:
> http://www.theirwebsite.com/browse/results?sort=0&offset=11&pageSize=10
> These browsers treat the question mark similar to a directory separator and do not require a file to be specified before the query.
> HttpClient's current behavior does not correspond to current browser behavior and leads to an inability to crawl certain websites if HttpClient's URI class is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org