You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Susam Pal (JIRA)" <ji...@apache.org> on 2007/09/18 20:13:43 UTC

[jira] Created: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
--------------------------------------------------------------------------

                 Key: NUTCH-557
                 URL: https://issues.apache.org/jira/browse/NUTCH-557
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.0.0
            Reporter: Susam Pal


'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.

The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529528 ] 

Susam Pal commented on NUTCH-557:
---------------------------------

Thank you, Doğacan and Andrzej for your comments. I started developing it in a fresh directory, hence it resulted in a new plugin. I can turn it into a patch for protocol-httpclient. I have two questions to ask.

1. In my plugin, the structure of the code is a little different from that of protocol-httpclient. This means, if I simply replace Http.java and HttpResponse.java of protocol-httpclient with mine, the diff would be huge. Or do you prefer carefully merging Http.java and HttpResponse.java of protocol-http11 with protocol-httpclient, so that the diff makes sense?

2. I don' see these files of protocol-httpclient being used anywhere:- (i) DummySSLProtocolSocketFactory.java (ii) DummyX509TrustManager.java (iii) HttpAuthenticationException.java (iv) HttpAuthenticationFactory.java (v) HttpAuthentication.java (vi) HttpBasicAuthentication.java. Moreover, my plugin includes basic authentication and HTTPS. So what should be done to these unused files?

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Susam Pal closed NUTCH-557.
---------------------------

    Resolution: Won't Fix

As per the discussion, 'protocol-http11' has been turned into a patch for 'protocol-httpclient'. This patch is available at NUTCH-559. <https://issues.apache.org/jira/browse/NUTCH-559>

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529394 ] 

Doğacan Güney commented on NUTCH-557:
-------------------------------------

Hi Susam,

This looks useful but I wonder: Why not make this a patch against protocol-httpclient instead of another plugin? AFAICS, they are more similar then they are different.

Also, you don't seem to stop reading http response after content-length bytes. Btw, it is probably better to dump fetch trace to LOG.trace or debug not info.

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529522 ] 

Andrzej Bialecki  commented on NUTCH-557:
-----------------------------------------

I agree with Dogacan - I don't see why this plugin shouldn't be turned into a patch for protocol-httpclient, simply adding the options that you added to your plugin. Other than these options these two plugins are identical.

Regarding the benefits of using http/1.1: the main difference, from the Nutch point of view, would be the support for keep-alives, i.e. the ability to send multiple requests over the same TCP connection. However, in practice this functionality is only rarely useful in our case, because it requires making many requests to the same host - whereas Nutch shuffles the hosts in order to provide a higher throughput and at the same time maintain the politeness settings. This means that with a large fetchlist containing many hosts, consecutive requests almost never go to the same host. This in turn means that in order to benefit from keep-alives we would have to keep around massive numbers of open connections (infeasible), or we have to drop connections between requests ... which is what http/1.0 does :)

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528854 ] 

Susam Pal commented on NUTCH-557:
---------------------------------

No, there isn't any significant difference in performance. Here's a list of the CPU time consumed by Nutch crawl for 15 attempts (5 per plugin). 

They are in the order: Serial No, protocol-http11, protocol-http, protocol-httpclient. The values are in seconds.

1) 17.6, 17.4, 17.4
2) 17.4, 17.2, 17.5
3) 23.6, 23.7, 23.3
4) 31.9, 33.7, 31.6
5) 51.1, 51.2, 52.1


> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Susam Pal updated NUTCH-557:
----------------------------

    Priority: Minor  (was: Major)

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Susam Pal updated NUTCH-557:
----------------------------

    Attachment: protocol-http11v0.1.patch

I have generated this patch against Nutch trunk.

To apply:-

patch -p0 < protocol-http11v0.1.patch
ant

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Susam Pal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529530 ] 

Susam Pal commented on NUTCH-557:
---------------------------------

Point no. 2 of my previous comment is incorrect. The SSL related files are being used where as the authentication-related files are not used. What do you suggest for the unused files?

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

Posted by "Emmanuel Joke (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528729 ] 

Emmanuel Joke commented on NUTCH-557:
-------------------------------------

Did you notice any difference in term of performance ? improvement or degradation ?

> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-557
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: protocol-http11v0.1.patch
>
>
> 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
> The user guide and other information can be found here:- [http://wiki.apache.org/nutch/protocol-http11]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.