You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2005/10/12 00:35:09 UTC

[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)

I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)

Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)

I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):

Protocol-HTTPClient-Innovation: 
1,321,470 milliseconds

Protocol-HTTP: 
26,946,076 milliseconds

Protocol-HttpClient: 
27,062,854 milliseconds


P.S.
Please note, Protocol-HTTPClient-Innovation plugin is only basic version, v.0.1.0,
HttpFactory is growing and contains cache (3 TCP connections per Host)
http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level... style of a source code may seem too old... you may need to change "enum" to "enumeration" in downloaded source files in order to compile it :)))

Very popular load-generating tool is based on HTTPClient (Innovation):
http://grinder.sourceforge.net/
http://www.innovation.ch/java/HTTPClient/


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Fuad Efendi <fu...@efendi.ca>.
I have to perform another test... At least we know that the problem is in
Network Layer...
I believe: not only HTTP_1_1, but also establishment of TCP connection takes
long time (including intermediary equipment such as routers, firewall,
per-IP-based load balancers, ...)

In my sample, HttpFactory caches TCP connections (3 sockets per host), and
HTTPClient automatically reestablishes HTTP-Keep-Alive each 60 seconds,
probably HttpClient/Apache also has this functionality which we don't use
yet...

Thanks,
Fuad

>This is interesting. Could you please check what is the difference in 
>this benchmark, if you set HttpVersion.HTTP_1_1 in 
>protocol-httpclient/HttpResponse.java:92 ?

>Unfortunately, Nutch cannot use that library because it's LGPL.
-- 
Best regards,
Andrzej Bialecki     <><


Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Andrzej Bialecki <ab...@getopt.org>.
Fuad Efendi (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
> 
> Fuad Efendi updated NUTCH-109:
> ------------------------------
> 
>     Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)
> 
> I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)
> 
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)
> 
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):
> 
> Protocol-HTTPClient-Innovation: 
> 1,321,470 milliseconds
> 
> Protocol-HTTP: 
> 26,946,076 milliseconds
> 
> Protocol-HttpClient: 
> 27,062,854 milliseconds

This is interesting. Could you please check what is the difference in 
this benchmark, if you set HttpVersion.HTTP_1_1 in 
protocol-httpclient/HttpResponse.java:92 ?

Unfortunately, Nutch cannot use that library because it's LGPL.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Fuad Efendi <fu...@efendi.ca>.
>Several days for 120,000 pages? That's very slow. Could you show some
status lines in the log file? (grep "status:") What's the bandwidth you
have?

AJ,

I mean: I haven't tried to run "-depth 20", I run "-depth 6" and crawled
21,000 pages for 7-8 hours... I mirrored 120,000 pages from www.apache.org
usig Teleport Ultra, total about 10 hours for this crawl (8mbps download, 10
threads);

During 3 tests I crawled (each time) 21,000 pages from _local_ web-site (in
the same LAN segment, 100mbps); existing plugins required 8 hours per 21,000
pages, so I couldn't try 120,000 pages...



Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by AJ Chen <ca...@gmail.com>.
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep "status:") What's the bandwidth you have?

-AJ

On 10/11/05, Fuad Efendi (JIRA) <ji...@apache.org> wrote:
>
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
> Summary: Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
> Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server
> installation, with crawled 120,000 pages (I used Teleport Ultra to crawl
> HTML pages from www.apache.org <http://www.apache.org>, I spent probably
> 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1),
> and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take
> few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
>
>
> P.S.
> Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
> v.0.1.0,
> HttpFactory is growing and contains cache (3 TCP connections per Host)
> http://www.innovation.ch/java/HTTPClient/ is very old but _production_
> level... style of a source code may seem too old... you may need to change
> "enum" to "enumeration" in downloaded source files in order to compile it
> :)))
>
> Very popular load-generating tool is based on HTTPClient (Innovation):
> http://grinder.sourceforge.net/
> http://www.innovation.ch/java/HTTPClient/
>
>
> > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> > -----------------------------------------------------------------------
> >
> > Key: NUTCH-109
> > URL: http://issues.apache.org/jira/browse/NUTCH-109
> > Project: Nutch
> > Type: Improvement
> > Components: fetcher
> > Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> > Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> > Reporter: Fuad Efendi
> > Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and Suse
> Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each object
> run each thread; each object is absolutely thread-safe, so we can send
> multiple GET requests using single instance:
> > private static int CLIENTS_PER_HOST = NutchConf.get().getInt("
> http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>