You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2005/10/12 00:50:04 UTC

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331847 ] 

Doug Cutting commented on NUTCH-109:
------------------------------------

Is your HTTP client polite?  Does it only have a single connection open the the server at a time, and does it pause fetcher.server.delay between each request?  It looks as though you are permitting three simultaneous requests, and I can see no delays.

How did you configure protocol-http and protocol-httpclient?  One can configure these to use multiple connections per server by increasing fetcher.threads.per.host.  By default they will only make a single request at a time.  One can also configure these to not delay between requests by setting fetcher.server.delay to zero.  Such settings are not considered polite, but they will substantially improve fetcher performance.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira