You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2009/09/04 21:00:58 UTC

[jira] Created: (NUTCH-751) Upgrade version of HttpClient

Upgrade version of HttpClient 
------------------------------

                 Key: NUTCH-751
                 URL: https://issues.apache.org/jira/browse/NUTCH-751
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Julien Nioche


The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
 
I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.

J.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753175#action_12753175 ] 

Julien Nioche commented on NUTCH-751:
-------------------------------------

Thanks for the pointer Ken, what will be very useful when I start looking into this 

> Upgrade version of HttpClient 
> ------------------------------
>
>                 Key: NUTCH-751
>                 URL: https://issues.apache.org/jira/browse/NUTCH-751
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>
> The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
> Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
>  
> I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751893#action_12751893 ] 

Andrzej Bialecki  commented on NUTCH-751:
-----------------------------------------

In general, if new version of a third-party package doesn't cause regression then we should upgrade.

> Upgrade version of HttpClient 
> ------------------------------
>
>                 Key: NUTCH-751
>                 URL: https://issues.apache.org/jira/browse/NUTCH-751
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>
> The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
> Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
>  
> I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-751) Upgrade version of HttpClient

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-751.
---------------------------------

    Resolution: Later

The changes in the underlying API are quite substantial and this would need a bit of work. Maybe this could be done as part of crawler-commons? In the meantime I'll just mark it as 'later' 

> Upgrade version of HttpClient 
> ------------------------------
>
>                 Key: NUTCH-751
>                 URL: https://issues.apache.org/jira/browse/NUTCH-751
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>
> The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
> Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
>  
> I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798890#action_12798890 ] 

Ken Krugler commented on NUTCH-751:
-----------------------------------

i agree that this should be in crawler-commons. E.g. I've recently made changes to avoid synchronization bottlenecks with HttpClient 4.0, and identified a few places in HC where things should be improved.

Though I'm concerned that the level of customization each crawler wants could result in a pretty ugly ball of code. For example, in Bixo I'm looking at how to use a streaming disk buffer for reads, to avoid OOM errors when many threads x big responses. How would that get implemented in a way that's friendly to Nutch, Droids & Heritrix?

If we could define some least-common-denominator API, that would be a good starting point. E.g. here are the set of config values, here are the set of parameters required when making a request, and here's the format of the response from a request.


> Upgrade version of HttpClient 
> ------------------------------
>
>                 Key: NUTCH-751
>                 URL: https://issues.apache.org/jira/browse/NUTCH-751
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>
> The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
> Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
>  
> I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753069#action_12753069 ] 

Ken Krugler commented on NUTCH-751:
-----------------------------------

I'm using HttpClient 4.0 in Bixo, and I agree that Nutch should upgrade.

But the API has been changed significantly, as I'm sure Julien has seen. Lots of improvements, but this will be a non-trivial patch.

There was a recent (Sept 2nd) post on the HttpClient list by Gerald Turner, and a response by Oleg, that contained a lot of useful info about migrating from 3.1 to 4.0

> Upgrade version of HttpClient 
> ------------------------------
>
>                 Key: NUTCH-751
>                 URL: https://issues.apache.org/jira/browse/NUTCH-751
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>
> The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/.
> Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale.
>  
> I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.