You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2005/10/11 02:01:10 UTC

[jira] Created: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Nutch - Fetcher - HTTP - Performance Testing & Tuning
-----------------------------------------------------

         Key: NUTCH-109
         URL: http://issues.apache.org/jira/browse/NUTCH-109
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7, 0.6, 0.7.1, 0.8-dev    
 Environment: Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
    Reporter: Fuad Efendi


1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 

2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...

I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...

I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)

Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/

Please note: 

Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
   private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);

I'll add more comments after finishing tests...



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Fuad Efendi <fu...@efendi.ca>.
>Several days for 120,000 pages? That's very slow. Could you show some
status lines in the log file? (grep "status:") What's the bandwidth you
have?

AJ,

I mean: I haven't tried to run "-depth 20", I run "-depth 6" and crawled
21,000 pages for 7-8 hours... I mirrored 120,000 pages from www.apache.org
usig Teleport Ultra, total about 10 hours for this crawl (8mbps download, 10
threads);

During 3 tests I crawled (each time) 21,000 pages from _local_ web-site (in
the same LAN segment, 100mbps); existing plugins required 8 hours per 21,000
pages, so I couldn't try 120,000 pages...



Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by AJ Chen <ca...@gmail.com>.
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep "status:") What's the bandwidth you have?

-AJ

On 10/11/05, Fuad Efendi (JIRA) <ji...@apache.org> wrote:
>
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
> Summary: Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
> Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server
> installation, with crawled 120,000 pages (I used Teleport Ultra to crawl
> HTML pages from www.apache.org <http://www.apache.org>, I spent probably
> 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1),
> and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take
> few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
>
>
> P.S.
> Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
> v.0.1.0,
> HttpFactory is growing and contains cache (3 TCP connections per Host)
> http://www.innovation.ch/java/HTTPClient/ is very old but _production_
> level... style of a source code may seem too old... you may need to change
> "enum" to "enumeration" in downloaded source files in order to compile it
> :)))
>
> Very popular load-generating tool is based on HTTPClient (Innovation):
> http://grinder.sourceforge.net/
> http://www.innovation.ch/java/HTTPClient/
>
>
> > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> > -----------------------------------------------------------------------
> >
> > Key: NUTCH-109
> > URL: http://issues.apache.org/jira/browse/NUTCH-109
> > Project: Nutch
> > Type: Improvement
> > Components: fetcher
> > Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> > Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> > Reporter: Fuad Efendi
> > Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and Suse
> Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each object
> run each thread; each object is absolutely thread-safe, so we can send
> multiple GET requests using single instance:
> > private static int CLIENTS_PER_HOST = NutchConf.get().getInt("
> http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332089 ] 

Otis Gospodnetic commented on NUTCH-109:
----------------------------------------

If I follow everything correctly, you run your performance tests, and the conclusion is that all 3 plugins performed roughly the same, and there are no improvements to be made in this particular place at this time.
Shall we close this issue now, then?  

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by EM <em...@cpuedge.com>.
>It is possible to configure Linux box (1Mb RAM) with 6000 client threads in Worker model. It is limited only by amout of available RAM. I used such configuration in production, 6 Apache servers sustained 75000 of concurrent users performing 1 request per minute, 4kb HTML pages, load/stress tests by Compuware.
>
>Default installation of Apache Web Server has 150 client threads allowed;
>
>What does it mean for us? One shared TCP transport connection per Web Server, one instance of Client Thread on Apache. It is impossible to overload Apache using single TCP connection and performing 100 requests per second; another 149 Threads will successfully handle client requests.
>
>Such proposed behavior of a Search Engine should not be considered as Denial of Service Attack; we are using single TCP connection for multiple requests.
>
# A denial of service (DoS) attack floods a network with an overwhelming 
amount of traffic, slowing its response time for legitimate traffic or 
grinding it to a halt completely. The more common attacks use built-in 
“features” of the TCP/IP protocol to create exponential amounts of 
network traffic.
www.techdirectcomputers.com/Encyclopedia.htm

# A denial-of-service attack (also, DoS attack) is an attack on a 
computer system or network that causes a loss of service to users, 
typically the loss of network connectivity and services by consuming the 
bandwidth of the victim network or overloading the computational 
resources of the victim system.
en.wikipedia.org/wiki/DDoS

For more, type "define: DDOS" on google.com

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332070 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

All 3 plugins perform the same.
However, first two plugins used single shared Socket for all 20 threads; third plugin used 3 shared Sockets for 20 threads. 
Third one (new plugin based on old Innovation HTTPClient framework) had dead-locks when I tried to run 20 threads over single HTTPClient instance.

It is possible to configure Linux box (1Mb RAM) with 6000 client threads in Worker model. It is limited only by amout of available RAM. I used such configuration in production, 6 Apache servers sustained 75000 of concurrent users performing 1 request per minute, 4kb HTML pages, load/stress tests by Compuware.

Default installation of Apache Web Server has 150 client threads allowed;

What does it mean for us? One shared TCP transport connection per Web Server, one instance of Client Thread on Apache. It is impossible to overload Apache using single TCP connection and performing 100 requests per second; another 149 Threads will successfully handle client requests.

Such proposed behavior of a Search Engine should not be considered as Denial of Service Attack; we are using single TCP connection for multiple requests.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331913 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I was totally wrong and unfair:
====
>Have you seen Kelvin Tan's patch? 
>You should take a look, it's in JIRA, and addresses some of the 
>HTTP/1.1 issues that you are concerned about. 

http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html 
====

I need to perform tests with Kelvin Tan's patch too.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Fuad Efendi <fu...@efendi.ca>.
I have to perform another test... At least we know that the problem is in
Network Layer...
I believe: not only HTTP_1_1, but also establishment of TCP connection takes
long time (including intermediary equipment such as routers, firewall,
per-IP-based load balancers, ...)

In my sample, HttpFactory caches TCP connections (3 sockets per host), and
HTTPClient automatically reestablishes HTTP-Keep-Alive each 60 seconds,
probably HttpClient/Apache also has this functionality which we don't use
yet...

Thanks,
Fuad

>This is interesting. Could you please check what is the difference in 
>this benchmark, if you set HttpVersion.HTTP_1_1 in 
>protocol-httpclient/HttpResponse.java:92 ?

>Unfortunately, Nutch cannot use that library because it's LGPL.
-- 
Best regards,
Andrzej Bialecki     <><


Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by Andrzej Bialecki <ab...@getopt.org>.
Fuad Efendi (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
> 
> Fuad Efendi updated NUTCH-109:
> ------------------------------
> 
>     Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)
> 
> I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)
> 
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)
> 
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):
> 
> Protocol-HTTPClient-Innovation: 
> 1,321,470 milliseconds
> 
> Protocol-HTTP: 
> 26,946,076 milliseconds
> 
> Protocol-HttpClient: 
> 27,062,854 milliseconds

This is interesting. Could you please check what is the difference in 
this benchmark, if you set HttpVersion.HTTP_1_1 in 
protocol-httpclient/HttpResponse.java:92 ?

Unfortunately, Nutch cannot use that library because it's LGPL.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)

I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)

Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)

I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):

Protocol-HTTPClient-Innovation: 
1,321,470 milliseconds

Protocol-HTTP: 
26,946,076 milliseconds

Protocol-HttpClient: 
27,062,854 milliseconds


P.S.
Please note, Protocol-HTTPClient-Innovation plugin is only basic version, v.0.1.0,
HttpFactory is growing and contains cache (3 TCP connections per Host)
http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level... style of a source code may seem too old... you may need to change "enum" to "enumeration" in downloaded source files in order to compile it :)))

Very popular load-generating tool is based on HTTPClient (Innovation):
http://grinder.sourceforge.net/
http://www.innovation.ch/java/HTTPClient/


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
     
Andrzej Bialecki  closed NUTCH-109:
-----------------------------------

    Resolution: Invalid

Proposed improvement is not real, and comes from different config. settings. Proposed implementation uses a component with incompatible license.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Please see attachment for more details.

In order to be fair (protocol-http uses single shared Socket per Host) I tried to modify this line in new plugin, HttpFactory.java:
private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 1);

It was 3 before. However, with http.clients.per.host=1 new plugin stops in a dead-lock. I tried few times, it always stops after 3-4 minutes. So, results are with http.clients.per.host=3 for new plugin (as it was before), but new plugin didn't pass the test, just a baseline.



New Test Results:
===============

1. PROTOCOL-HTTP 
=================
910,549,682 bytes (size on disk, WebDB+Segments)
1,201,908 milliseconds

2. PROTOCOL-HTTPCLIENT
========================
935,856,675 bytes
1,261,064 milliseconds

999. PROTOCOL-HTTPCLIENT-INNOVATION
==================================
936,152,532 bytes
1,305,377 milliseconds




nutch-site.xml
==============
<property>
  <name>fetcher.server.delay</name>
  <value>0</value>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>20</value>
</property>

<property>
  <name>http.timeout</name>
  <value>30000</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>



Client:
=======
IBM ThinkPad T42p, 2Ghz, 2Gb, Windows XP, J2SE 1.4.2_09


Server:
=======
Suse Linux 9.3, Apache HTTPD 2.0.53-9.5, Worker


Command:
========
bin\nutch7 crawl url3.txt -dir crawl005 -threads 20 -depth 6

(Modified crawl without indexing)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332079 ] 

Andrzej Bialecki  commented on NUTCH-109:
-----------------------------------------

Fuad, please read again carefully what Otis said: such behaviour by a crawler IS generally considered rude / impolite, even if the target machine survives this. Whether you use a single TCP connection or multiple connections makes almost no difference - you are abusing someone's public service, and prevent other users from using it. You made your tests with a bunch of static pages - fine, but in real life there is some logic and DBs behind them, and by flooding the target servers you monopolize those resources, too, degrading the service for all others.

If you really want to flood your target servers with requests, it's up to you - you can re-configure Nutch to do it - and you should be prepared to suffer from this when the target servers ban your crawler's IP. But the Nutch project should not advocate such irresponsible behaviour. Consequently, we should never use such settings as default.

Aside from the above, as I said before the Innovation code is covered by LGPL, so it cannot be imported to Nutch repository.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331764 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

By default, Java 1.4 caches DNS-to-IP mappings forever... 

   java.security.Security.setProperty("networkaddress.cache.ttl" , "10000");

- we need to add smth in code/configuration.


> Nutch - Fetcher - HTTP - Performance Testing & Tuning
> -----------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331877 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Ok, I'll do it tonight;
I believe fetcher.server.delay means "Wait for a Response from Server, then throw a Timeout Exception"
I can also execute 1000 threads, we will have fair comparison even with fetcher.server.delay=50 seconds (fair - because of too many threads - we will have probably 20 requests per second, 20 * 50 = 1000)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: suspicious outlink count

Posted by Piotr Kosiorowski <pk...@gmail.com>.
EM wrote:
> 202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
> 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
> 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
> 
> If there is maxoutlinks already specified in the xml config, why does 
> nutch bother counting anything over that again?

During PageRank computation nutch retrieves all links from given page
by MD5. If we have many pages with the same MD5 it can retrieve all 
outlinks from these pages - I saw some "bot traps" that had big site 
structures that had exactly the same MD5 (once I had over a milion of 
identical pages in my index with different urls from the same host).So 
in this case we are getting the union af all such outlinks. In some 
situations having a big number of outlinks is not a problem (like in 
your case - all pages injected from dmoz are outlinks from dmoz) - but 
usually it indicates some problems in your index or at least a reason to 
look at it. So I have decided to print a warning in this case so one can
have a look at such site.
Regards
Piotr


suspicious outlink count

Posted by EM <em...@cpuedge.com>.
202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does 
nutch bother counting anything over that again?

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by EM <em...@cpuedge.com>.
>>We have network equipment limitations too, we can't reach more than
>>65000 threads over single LAN card, and JVM is good (but better is to
>>have multiple JVM/processes, 100 threads each...) 
>>    
>>
>
>65000 threads?  What are you trying to fetch?  The whole web?
>
>
>Otis
>  
>
65000/100 =650 processes.

What kind of server do you have?

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by og...@yahoo.com.
Hi,

I find it a bit hard to follow your various ideas here, but I'll add my
comments to some parts below.

--- "Fuad Efendi (JIRA)" <ji...@apache.org> wrote:

>     [
>
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892
> ] 
> 
> Fuad Efendi commented on NUTCH-109:
> -----------------------------------
> 
> This method:
>   private static InetAddress blockAddr(URL url) throws
> ProtocolException {...}

Where is this method?

> I checked it in both classes:
>   org.apache.nutch.protocol.http.Http
>   org.apache.nutch.protocol.httpclient.Http
> 
> Default settings (nutch-default.xml):
>   fetcher.server.delay=5.0 (seconds)
>   fetcher.threads.per.host=1
> 
> blockAddr(...) method blocks Internet Address for
> fetcher.server.delay amount of time, it "blocks" this address for all
> threads except current thread. Rest of threads are in Sleep() state;
> amount of sleeping threads is limited by
>   fetcher.threads.per.host

That doesn't sound right.  That property is not meant for specifying
sleep time, but rather the number of threads that are allowed to hit
the same host at the same time.  In other words, this lets you control
the degree of parallelization, so to speak.  That is the equivalent of
those "3 TCP connections" you were mentioning yesterday.

fetcher.server.delay is what specifies "sleep between requests" time.

> So, playing with this parameters we can probably improve performance;
> I'm going to perform new performance tests.
> 
> New plugin does not use this:
>   http.timeout=10000
>   http.content.limit=65536

This may affect your benchmark.  I don't know how much, but it will.

> Keep-Alive timeout is very important; default "Keep-Alive" timeout of
> a new plugin is 60 seconds (it automatically closes HTTP after 60
> seconds).
> 
> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
> times (TCP HandShake? some IP packets...)
> 2. Apache HTTPD Server creates Client thread to handle our requests,
> 1 second (more or less, try Internet Explorer, first page takes few
> second to download, then browsing works very fast - we have personal
> Thread on the Server).

This is often be due to the initial hostname address lookup, when the
domain name server doesn't have the host name IP address already
cached.

> 3. Line 135, HttpResponse.java:
>      get.releaseConnection();
> 
> Unfortunately we won't use HTTP/1.1 even if I modify some parameters
> such as
>    HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
> - we close connection at the end...

Have you seen Kelvin Tan's patch?
You should take a look, it's in JIRA, and addresses some of the
HTTP/1.1 issues that you are concerned about.

> We have network equipment limitations too, we can't reach more than
> 65000 threads over single LAN card, and JVM is good (but better is to
> have multiple JVM/processes, 100 threads each...) 

65000 threads?  What are you trying to fetch?  The whole web?


Otis


> We can load network segment for only 30% due to those HandShakes and
> delays...
> 
> Compare with any free available Web-Grabber tool, even IE/Netscape,
> downloading single big file can use 99% of network capacity,
> downloading multiple HTML - only 20-30% (I saw it in Teleport Pro
> during downloads from multiple linked to Apache sites, 10 threads)
> 
> Apache's MultiThreadedExample.java  uses single instance of
> HttpClient for multiple threads,
>
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup
> 
>         // Create an HttpClient with the
> MultiThreadedHttpConnectionManager.
>         // This connection manager must be used if more than one
> thread will
>         // be using the HttpClient.
>         HttpClient httpClient = new HttpClient(new
> MultiThreadedHttpConnectionManager());
>         // Set the default host/protocol for the methods to connect
> to.
>         // This value will only be used if the methods are not given
> an absolute URI
>        
> httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80,
> "http");
> 
> 
> Same was done in a new plugin, with a basic very small code.
> 
> I am going to perform new tests; any suggestions are highly
> welcomed... 
> it will take few days (10 hours per test)
> 
> 
> > Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation
> >
>
-----------------------------------------------------------------------
> >
> >          Key: NUTCH-109
> >          URL: http://issues.apache.org/jira/browse/NUTCH-109
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> >  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
> >     Reporter: Fuad Efendi
> >  Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment 
> > 2. Web Server creates Client thread and hopes that Nutch really
> uses HTTP/1.1, or at least Nutch sends "Connection: close" before
> closing in JVM "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new
> plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that
> existing http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and
> Suse Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note: 
> > Class HttpFactory contains cache of HTTPConnection objects; each
> object run each thread; each object is absolutely thread-safe, so we
> can send multiple GET requests using single instance:
> >    private static int CLIENTS_PER_HOST =
> NutchConf.get().getInt("http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
> 
> 


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

This method:
  private static InetAddress blockAddr(URL url) throws ProtocolException {...}

I checked it in both classes:
  org.apache.nutch.protocol.http.Http
  org.apache.nutch.protocol.httpclient.Http

Default settings (nutch-default.xml):
  fetcher.server.delay=5.0 (seconds)
  fetcher.threads.per.host=1

blockAddr(...) method blocks Internet Address for fetcher.server.delay amount of time, it "blocks" this address for all threads except current thread. Rest of threads are in Sleep() state; amount of sleeping threads is limited by
  fetcher.threads.per.host

So, playing with this parameters we can probably improve performance; I'm going to perform new performance tests.

New plugin does not use this:
  http.timeout=10000
  http.content.limit=65536

Keep-Alive timeout is very important; default "Keep-Alive" timeout of a new plugin is 60 seconds (it automatically closes HTTP after 60 seconds).

1. we are establishing TCP transport, 100-300 milliseconds X 2-3 times (TCP HandShake? some IP packets...)
2. Apache HTTPD Server creates Client thread to handle our requests, 1 second (more or less, try Internet Explorer, first page takes few second to download, then browsing works very fast - we have personal Thread on the Server).
3. Line 135, HttpResponse.java:
     get.releaseConnection();

Unfortunately we won't use HTTP/1.1 even if I modify some parameters such as
   HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
- we close connection at the end...

We have network equipment limitations too, we can't reach more than 65000 threads over single LAN card, and JVM is good (but better is to have multiple JVM/processes, 100 threads each...) 

We can load network segment for only 30% due to those HandShakes and delays...

Compare with any free available Web-Grabber tool, even IE/Netscape, downloading single big file can use 99% of network capacity, downloading multiple HTML - only 20-30% (I saw it in Teleport Pro during downloads from multiple linked to Apache sites, 10 threads)

Apache's MultiThreadedExample.java  uses single instance of HttpClient for multiple threads,
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup

        // Create an HttpClient with the MultiThreadedHttpConnectionManager.
        // This connection manager must be used if more than one thread will
        // be using the HttpClient.
        HttpClient httpClient = new HttpClient(new MultiThreadedHttpConnectionManager());
        // Set the default host/protocol for the methods to connect to.
        // This value will only be used if the methods are not given an absolute URI
        httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80, "http");


Same was done in a new plugin, with a basic very small code.

I am going to perform new tests; any suggestions are highly welcomed... 
it will take few days (10 hours per test)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331904 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I can't use Email right now, so put comments here:
===
>Have you seen Kelvin Tan's patch?
>You should take a look, it's in JIRA, and addresses some of the
>HTTP/1.1 issues that you are concerned about.

And my reply:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html

===
>>   private static InetAddress blockAddr(URL url) throws
>> ProtocolException {...}

>Where is this method?

[plugin-httpclient] & [protocol-http], Http.java

===
>> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
>> times (TCP HandShake? some IP packets...)
>> 2. Apache HTTPD Server creates Client thread to handle our requests,
>> 1 second (more or less, try Internet Explorer, first page takes few
>> second to download, then browsing works very fast - we have personal
>> Thread on the Server).

>This is often be due to the initial hostname address lookup, when the
>domain name server doesn't have the host name IP address already
>cached.

Not. DNS Lookup happens only onse per JVM lifecycle; 1 & 2 HandShakes happen meny times.

===
>> We have network equipment limitations too, we can't reach more than
>> 65000 threads over single LAN card, and JVM is good (but better is to
>> have multiple JVM/processes, 100 threads each...) 

>65000 threads?  What are you trying to fetch?  The whole web?

It was a sample for people trying to use more threads for better performenca; they can't use more that 65000. Also, nobody tested JVM, SUN's JVM 1.3.1 performed ugly with more than 100 threads (at least, on Windows).


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331946 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

>One can also configure these to not delay between requests by setting fetcher.server.delay to zero. 
>Such settings are not considered polite, but they will substantially improve fetcher performance. 

Such settings must be considered polite since we are using single transport channel for multiple requests. Server should decide when reply/delay depending on overall load.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332083 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Sorry for typo in previous post: Apache HTTPD server, 1 Gb RAM, single CPU, Worker model... it uses multiple processes and multiple threads, about 1.2Mb memory per thread.

Default setting for KeepAliveTimeout on Server: 15 seconds
http://httpd.apache.org/docs/2.0/mod/core.html#keepalivetimeout

We are using "Keep-Alive" only when we send subsequent requests within this 15 seconds interval.

Current Nutch is polite, with default 5 seconds interval and randomization in a fetch list. I was wrong, my previous "proposal" improves performance only for limited crawls (single web-server, etc.), and it is stupid for whole-web crawls.

I created this issue because noticed some performance-related questions in mailing lists (I also sent such questions in August-September).
Test Result: performance is good.

I had one post related to "we are killing web-servers" - we send HTTP request, Server creates Client Thread, we send another HTTP request over another TCP Socket - I was wrong again, we are using shared TCP connection per host, and Server does not create 5 Client Threads for 5 HTTP requests; it uses single Thread whenever possible.




> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332091 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Yes, close this issue please.
Thanks


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Attachment: protocol-httpclient-innovation-0.1.0.zip

New Plugin, you may play with commenting this code in HttpFactory
	static {
		CookieModule.setCookiePolicyHandler(null);
	}



> Nutch - Fetcher - HTTP - Performance Testing & Tuning
> -----------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331847 ] 

Doug Cutting commented on NUTCH-109:
------------------------------------

Is your HTTP client polite?  Does it only have a single connection open the the server at a time, and does it pause fetcher.server.delay between each request?  It looks as though you are permitting three simultaneous requests, and I can see no delays.

How did you configure protocol-http and protocol-httpclient?  One can configure these to use multiple connections per server by increasing fetcher.threads.per.host.  By default they will only make a single request at a time.  One can also configure these to not delay between requests by setting fetcher.server.delay to zero.  Such settings are not considered polite, but they will substantially improve fetcher performance.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331857 ] 

Doug Cutting commented on NUTCH-109:
------------------------------------

Comparing protocol-http and protocol-httpclient with default settings, which permit only a single request at a time with five second delays between each request, to something that permits three simultaneous connections with no delays is not a fair comparison.  There is probably some advantage to using "Keep-Alive", but these benchmarks do not measure it.  To make a fair comparison you must configure Nutch with fetcher.server.delay=0 and fetcher.threads.per.host=3.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331897 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Opps... need to learn more!
[protocol-httpclient] Http.java is Singleton, it uses MultiThreadedHttpConnectionManager
It uses single instance of HttpClient for all hosts and all threads. 

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332086 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Andrzej,
I forgot about DBs!
Of course, I agree with Otis.
Innovation's code didn't survive against 20 threads; I tried to use it because suspected performance problems.
Thanks

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Attachment: test_results.txt

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331848 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I used default settings rof Nutch-0.7.1, I modified only <name>plugin.includes</name> in nutch-site.xml

HTTPClient is polite enough; HTTPClient(host) creates persistent TCP (and HTTP) connection, uses own threads to manage this connection, and automatically handles all "Keep-Alive", default for Keep-Alive is "60 seconds"; I've not studied their API throughly and I haven't tested it...

HttpFactory has default setting 3 HTTPClient -per-host (it means we have 3 TCP connections per single host, and we send multiple GET messages over single HTTP without waiting for reply)... I used 20 concurrent threads (so, I sent few HTTP requests per-TCP-channel)

Single HTTPConnection class is fully thread-safe and allows multiple threads to send multiple GET requests over single TCP connection... and all replies are in-sync... I can perform different test for this.



> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331907 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

>> ...try Internet Explorer, first page takes few 
>> second to download, then browsing works very fast - we have personal 
>> Thread on the Server

>This is often be due to the initial hostname address lookup, when the 
>domain name server doesn't have the host name IP address already 
>cached. 

However, I have local DNS Server on the network; it has a cache... Windows also has own DNS cache:
%SystemRoot%\system32\svchost.exe -k NetworkService


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332027 ] 

Otis Gospodnetic commented on NUTCH-109:
----------------------------------------

It looks like the 3 plugins perform roughly the same.  Am I reading this correctly?

Also I disagree with "Such settings must be considered polite since we are using single transport channel for multiple requests. Server should decide when reply/delay depending on overall load. ".

That's just not how things work in the world of web servers and web crawlers.  Sure, one can flood a server with requests and let the server deal with the sudden overload by replying slowly, but that wouldn't be the right/nice thing to do. :)  Think how many robots are out there, and think how such behaviour would impact human users visiting the overloaded server.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira