You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2005/10/12 00:35:09 UTC
[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Fuad Efendi updated NUTCH-109:
------------------------------
Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)
I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)
Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)
I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):
Protocol-HTTPClient-Innovation:
1,321,470 milliseconds
Protocol-HTTP:
26,946,076 milliseconds
Protocol-HttpClient:
27,062,854 milliseconds
P.S.
Please note, Protocol-HTTPClient-Innovation plugin is only basic version, v.0.1.0,
HttpFactory is growing and contains cache (3 TCP connections per Host)
http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level... style of a source code may seem too old... you may need to change "enum" to "enumeration" in downloaded source files in order to compile it :)))
Very popular load-generating tool is based on HTTPClient (Innovation):
http://grinder.sourceforge.net/
http://www.innovation.ch/java/HTTPClient/
> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
> Key: NUTCH-109
> URL: http://issues.apache.org/jira/browse/NUTCH-109
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> Reporter: Fuad Efendi
> Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
> private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
Posted by Fuad Efendi <fu...@efendi.ca>.
I have to perform another test... At least we know that the problem is in
Network Layer...
I believe: not only HTTP_1_1, but also establishment of TCP connection takes
long time (including intermediary equipment such as routers, firewall,
per-IP-based load balancers, ...)
In my sample, HttpFactory caches TCP connections (3 sockets per host), and
HTTPClient automatically reestablishes HTTP-Keep-Alive each 60 seconds,
probably HttpClient/Apache also has this functionality which we don't use
yet...
Thanks,
Fuad
>This is interesting. Could you please check what is the difference in
>this benchmark, if you set HttpVersion.HTTP_1_1 in
>protocol-httpclient/HttpResponse.java:92 ?
>Unfortunately, Nutch cannot use that library because it's LGPL.
--
Best regards,
Andrzej Bialecki <><
Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test
- new Protocol-HTTPClient-Innovation
Posted by Andrzej Bialecki <ab...@getopt.org>.
Fuad Efendi (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
> Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
This is interesting. Could you please check what is the difference in
this benchmark, if you set HttpVersion.HTTP_1_1 in
protocol-httpclient/HttpResponse.java:92 ?
Unfortunately, Nutch cannot use that library because it's LGPL.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
Posted by Fuad Efendi <fu...@efendi.ca>.
>Several days for 120,000 pages? That's very slow. Could you show some
status lines in the log file? (grep "status:") What's the bandwidth you
have?
AJ,
I mean: I haven't tried to run "-depth 20", I run "-depth 6" and crawled
21,000 pages for 7-8 hours... I mirrored 120,000 pages from www.apache.org
usig Teleport Ultra, total about 10 hours for this crawl (8mbps download, 10
threads);
During 3 tests I crawled (each time) 21,000 pages from _local_ web-site (in
the same LAN segment, 100mbps); existing plugins required 8 hours per 21,000
pages, so I couldn't try 120,000 pages...
Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
Posted by AJ Chen <ca...@gmail.com>.
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep "status:") What's the bandwidth you have?
-AJ
On 10/11/05, Fuad Efendi (JIRA) <ji...@apache.org> wrote:
>
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
> Summary: Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
> Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server
> installation, with crawled 120,000 pages (I used Teleport Ultra to crawl
> HTML pages from www.apache.org <http://www.apache.org>, I spent probably
> 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1),
> and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take
> few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
>
>
> P.S.
> Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
> v.0.1.0,
> HttpFactory is growing and contains cache (3 TCP connections per Host)
> http://www.innovation.ch/java/HTTPClient/ is very old but _production_
> level... style of a source code may seem too old... you may need to change
> "enum" to "enumeration" in downloaded source files in order to compile it
> :)))
>
> Very popular load-generating tool is based on HTTPClient (Innovation):
> http://grinder.sourceforge.net/
> http://www.innovation.ch/java/HTTPClient/
>
>
> > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> > -----------------------------------------------------------------------
> >
> > Key: NUTCH-109
> > URL: http://issues.apache.org/jira/browse/NUTCH-109
> > Project: Nutch
> > Type: Improvement
> > Components: fetcher
> > Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> > Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> > Reporter: Fuad Efendi
> > Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and Suse
> Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each object
> run each thread; each object is absolutely thread-safe, so we can send
> multiple GET requests using single instance:
> > private static int CLIENTS_PER_HOST = NutchConf.get().getInt("
> http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>