You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/10/01 04:58:21 UTC

RE: what contibute to fetch slowing down

Dear Nutchers,

I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb,
and with 2xOpteron252 x SuseLinux x 4Gb

I have only one explanation which should be probably mirrored at JIRA:

================
Network.
========

1.
I never had such a problem with The Grinder,
http://grinder.sourceforge.net, which is based on alternate HTTPClient
http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should
really review their HttpClient RC3(!!!) accordingly, HTTPClient
(upper--HTTP-case)is not "alpha", it is production version... I used
Grinder a lot, it allows to execute 32 processes with 64 threads each on
2048Mb RAM...

2.
I found at SUN API this: 
java.net.Socket
public void setReuseAddress(boolean on) - please check API!!!

3. 
I saw in your PROTOCOL-HTTP this code:
... HTTP/1.0 ...
Why? Why version 1.0??? It should understand server's reply such as
"Connection: close" "Connection: keep-alive" etc. (pls ignore typo).

4.
By the way, how many files UNIX needs in order to maintain 65536 network
sockets?

Respectfully,
Fuad

P.S.
Sorry guys, I don't have anough time to participate... Could you please
test this suspicious behaviour, and very strange opinion? Should I
create a new bug report at JIRA? 

SUN's Socket, Apache's HttpClient, UNIX's networking...

-----Original Message-----
From: Daniele Menozzi [mailto:menoz@ngi.it] 
Sent: Wednesday, September 28, 2005 4:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: what contibute to fetch slowing down

On  10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites.  The fetcher could 
> achieve
> 7 pages/sec initially, but the performance gradually dropped to about
2 
> pages/sec, sometimes even 0.5 pages/sec.  The fetch list had 300k
pages 
> and I used 500 threads. What are the main causes of this slowing down?

I have the same problem; I've tried with different number of fetchers
(10,20,50,100,..), but the download rate always decrease sistematically,
page after page. The machine is a p4 1.7, 768 MB ram, running debian on
2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out),
but I cannot obtain a stable, and high, page/s rate. I've also tried to
change machine and kernel, but the problem still remains. Can you please
give us some advice? Thank you for your help,
	Menoz

-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org

Re: what contibute to fetch slowing down

Posted by Doug Cutting <cu...@nutch.org>.

Fuad Efendi wrote:
> I found this in J2SE API for setReuseAddress(default: false):
> =====
> When a TCP connection is closed the connection may remain in a timeout
> state for a period of time after the connection is closed (typically
> known as the TIME_WAIT state or 2MSL wait state). For applications using
> a well known socket address or port it may not be possible to bind a
> socket to the required SocketAddress if there is a connection in the
> timeout state involving the socket address or port. 
> =====

This is related to server sockets, not client sockets.

> It probably means that we are reaching huge amount (65000!) of "waiting"
> TCP ports after Socket.close(); and Fetcher Theads are blocking by OS
> waiting when OS release some of these ports... Am I right?

I don't think that is an issue.

Doug

Re: java.net.MalformedURLException: no protocol for parse-plugins.xml

Posted by Jérôme Charron <je...@gmail.com>.

> Likely missing file:/. If I get rid of lines 617-622
> > of conf/nutch-default.xml
>
> Resolved and committed:
http://svn.apache.org/viewcvs.cgi?rev=293370&view=rev

Thanks Earl.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: java.net.MalformedURLException: no protocol for parse-plugins.xml

Posted by Jérôme Charron <je...@gmail.com>.

> Likely missing file:/. If I get rid of lines 617-622
> of conf/nutch-default.xml

Oups, sorry.
I made this last change just after testing the whole patch.
And I doesn't test it once again since I was sure it was a minor change.
I correct this right now. Sorry.

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/

java.net.MalformedURLException: no protocol for parse-plugins.xml

Posted by Earl Cahill <ca...@yahoo.com>.

I did a clean, full svn update, and ant on trunk, then
tried

bin/nutch crawl urls -dir crawl.test

and got

051002 224950 SEVERE Unable to load parse plugins file
from URL [parse-plugins.xml]
java.net.MalformedURLException: no protocol: ...

Likely missing file:/.  If I get rid of lines 617-622
of conf/nutch-default.xml

<property>
  <name>parse.plugin.file</name>
  <value>parse-plugins.xml</value>
  <description>The name of the file that defines the
associations between
  content-types and parsers.</description>
</property>

it at least lets me run my crawl.  Looks like that was
added in revision 292865 on friday by jerome.  Putting
in the full path works, as per this patch

Index: conf/nutch-default.xml
===================================================================
--- conf/nutch-default.xml      (revision 293226)
+++ conf/nutch-default.xml      (working copy)
@@ -616,7 +616,7 @@

 <property>
   <name>parse.plugin.file</name>
-  <value>parse-plugins.xml</value>
+ 
<value>file:/home/nutch/nutch/trunk/conf/parse-plugins.xml</value>
   <description>The name of the file that defines the
associations between
   content-types and parsers.</description>
 </property>

But yeah, that's not a good option.  I tried each
directory and none of them worked.

Hope this gets you at least close.

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

RE: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.

Unfortunately this is commented in Kelvin's code:

//      reqStr.append("Connection: Keep-Alive\r\n");


I found only     

reqStr.append(" HTTP/1.1\r\n");

- but it does not mean implementation of HTTP/1.1 features.


Teleport Ultra v.1.29 needs just a few hours to download all plain HTML
from SUN, Nutch needs few days. 8mbps/800kbps, download/upload.


-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Sunday, October 02, 2005 5:37 PM
To: nutch-dev@lucene.apache.org
Subject: RE: what contibute to fetch slowing down


Kelvin's OC implementation is queuing fetching request according to the
host and using http 1.1 protocol. It is a nutch patch currently.

Michael Ji,

--- Fuad Efendi <fu...@efendi.ca> wrote:

> Some suggestion to improve performance:
> 
> 
> 1. Decrease randomization of FetchList.
>  
> Here is comment from FetchListTool:
>    /**
>      * The TableSet class will allocate a given FetchListEntry
>      * into one of several ArrayFiles.  It chooses
> which
>      * ArrayFile based on a hash of the URL's domain
> name.
>      *
>      * It uses a hash of the domain name so that
> pages are
>      * allocated to a random ArrayFile, but
> same-host pages
>      * go to the same file (for efficiency purposes
> during
>      * fetch).
>      *
>      * Further, within a given file, the
> FetchListEntry items
>      * appear in random order.  This is so that we
> don't
>      * hammer the same site over and over again
> during fetch.
>      *
>      * Each table should receive a roughly
>      * even number of entries, but all URLs for a 
> specific 
>      * domain name will be found in a single table. 
> If
>      * the dataset is weirdly skewed toward large
> domains,
>      * there may be an uneven distribution.
>      */
> 
> Same "same-host pages go to the same file" - they
> should go in a
> sequence, without mixing/randomizing with other
> host-pages...
> 
> We are fetching single URL, then we forget about
> existense of this
> TCP/IP connection, we even forget that Web Server
> created Client Process
> to handle our HTTP requests, it is called Keep
> Alive. Creation of TCP
> connection, and additionally creation of a such
> Client Process on a Web
> Server costs a lot of CPU on both sides, Nutch &
> WebServer.
> 
> I suggest to use single Keep-Alive thread to fetch
> single Host, without
> randomization.
> 
> 
> 2. Use/Investigate more staff from Socket API such
> as
> public void setSoTimeout(int timeout)
> public void setReuseAddress(true)
> 
> I found this in J2SE API for
> setReuseAddress(default: false):
> =====
> When a TCP connection is closed the connection may
> remain in a timeout
> state for a period of time after the connection is
> closed (typically
> known as the TIME_WAIT state or 2MSL wait state).
> For applications using
> a well known socket address or port it may not be
> possible to bind a
> socket to the required SocketAddress if there is a
> connection in the
> timeout state involving the socket address or port.
> =====
> 
> It probably means that we are reaching huge amount
> (65000!) of "waiting"
> TCP ports after Socket.close(); and Fetcher Theads
> are blocking by OS
> waiting when OS release some of these ports... Am I
> right?
> 
> 
> P.S.
> Anyway, using Keep-Alive option is very important
> not only for us but
> also for Production Web Sites.
> 
> Thanks,
> Fuad
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Friday, September 30, 2005 10:58 PM
> To: nutch-dev@lucene.apache.org; menoz@ngi.it
> Subject: RE: what contibute to fetch slowing down
> 
> 
> Dear Nutchers,
> 
> 
> I noticed same problem twise, with PentiumMobile2Mhz
> & WindowsXP & 2Gb,
> and with 2xOpteron252 x SuseLinux x 4Gb
> 
> I have only one explanation which should be probably
> mirrored at JIRA:
> 
> 
> ================
> Network.
> ========
> 
> 
> 1.
> I never had such a problem with The Grinder, 
> http://grinder.sourceforge.net, which is based on alternate HTTPClient
> http://www.innovation.ch/java/HTTPClient/index.html.
> Apache SF should
> really review their HttpClient RC3(!!!) accordingly,
> HTTPClient
> (upper--HTTP-case)is not "alpha", it is production
> version... I used
> Grinder a lot, it allows to execute 32 processes
> with 64 threads each on
> 2048Mb RAM...
> 
> 
> 2.
> I found at SUN API this:
> java.net.Socket
> public void setReuseAddress(boolean on) - please
> check API!!!
> 
> 
> 3.
> I saw in your PROTOCOL-HTTP this code:
> ... HTTP/1.0 ...
> Why? Why version 1.0??? It should understand
> server's reply such as
> "Connection: close" "Connection: keep-alive" etc.
> (pls ignore typo).
> 
> 
> 4.
> By the way, how many files UNIX needs in order to
> maintain 65536 network
> sockets?
> 
> 
> Respectfully,
> Fuad
> 
> P.S.
> Sorry guys, I don't have anough time to
> participate... Could you please
> test this suspicious behaviour, and very strange
> opinion? Should I
> create a new bug report at JIRA?
> 
> SUN's Socket, Apache's HttpClient, UNIX's
> networking...
> 
> 
> 
> 
> -----Original Message-----
> From: Daniele Menozzi [mailto:menoz@ngi.it]
> Sent: Wednesday, September 28, 2005 4:42 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: what contibute to fetch slowing down
> 
> 
> On  10:27:55 28/Sep , AJ Chen wrote:
> > I started the crawler with about 2000 sites.  The
> fetcher could
> > achieve
> > 7 pages/sec initially, but the performance
> gradually dropped to about
> 2
> > pages/sec, sometimes even 0.5 pages/sec.  The
> fetch list had 300k
> pages
> > and I used 500 threads. What are the main causes
> of this slowing down?
> 
> 
> I have the same problem; I've tried with different
> number of fetchers
> (10,20,50,100,..), but the download rate always
> decrease
=== message truncated ===



	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

RE: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.

Doug,
Thanks for reply,
I'll try to perform specific tests against in-home Apache during this
week(end) (limited in time slightly... Sorry!). Everything possible,
usually Apache httpd has "timeout" setting for keep-alive, and default
setting is (I don't remember) probably 600 seconds. I performed such
tests a while ago using Grinder... Also, I can create baseline Nutch vs
Grinder, Nutch vs Teleport. Grinder can save HTTP reply in log files
(this is mostly load-generation tool); Teleport is commercial Web
Grabber...
For HTTP/1.0 we should send explicit message "Connection: close" before
Socket.close()... But we need to perform real tests anyway.

-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org] 
Sent: Monday, October 03, 2005 1:05 PM
To: nutch-dev@lucene.apache.org
Subject: Re: what contibute to fetch slowing down

Fuad Efendi wrote:
> If I am right, we are simply _killing_ many many sites with default 
> Apache HTTPD installation (Microsoft IIS, etc.) (150 keep-alive client

> threads; I configured 6000 threads for Worker model, but it was very 
> unusual). Those client threads are created each time for each single 
> HTTP request from Nutch, after 150 pages we are simply overloading Web

> Server, and we receive "connection timeout exception".

I would be surprised if a web server, on exhausting it's keep-alive 
cache, wouldn't simply close some of them.  Repeated connections without

keep-alive should not harm a web server, as long as they're polite.

> We need to use real Web Server during tests, and HTTP Proxy 
> (http://grinder.sourceforge.net - very simple Java based proxy)

That would be a great contribution.  Do you have time to work on this?

Doug

Re: what contibute to fetch slowing down

Posted by Doug Cutting <cu...@nutch.org>.

Fuad Efendi wrote:
> If I am right, we are simply _killing_ many many sites with default
> Apache HTTPD installation (Microsoft IIS, etc.) (150 keep-alive client
> threads; I configured 6000 threads for Worker model, but it was very
> unusual). Those client threads are created each time for each single
> HTTP request from Nutch, after 150 pages we are simply overloading Web
> Server, and we receive "connection timeout exception".

I would be surprised if a web server, on exhausting it's keep-alive 
cache, wouldn't simply close some of them.  Repeated connections without 
keep-alive should not harm a web server, as long as they're polite.

> We need to use real Web Server during tests, and HTTP Proxy
> (http://grinder.sourceforge.net - very simple Java based proxy)

That would be a great contribution.  Do you have time to work on this?

Doug

RE: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.

I never tried Kelvin's OC, I only browsed source code a little.

We need to make test with JVM 1.4, and JVM 1.5 (Kelvin's OC).

If I am right, we are simply _killing_ many many sites with default
Apache HTTPD installation (Microsoft IIS, etc.) (150 keep-alive client
threads; I configured 6000 threads for Worker model, but it was very
unusual). Those client threads are created each time for each single
HTTP request from Nutch, after 150 pages we are simply overloading Web
Server, and we receive "connection timeout exception".

We need to use real Web Server during tests, and HTTP Proxy
(http://grinder.sourceforge.net - very simple Java based proxy)




-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Sunday, October 02, 2005 5:37 PM
To: nutch-dev@lucene.apache.org
Subject: RE: what contibute to fetch slowing down


Kelvin's OC implementation is queuing fetching request according to the
host and using http 1.1 protocol. It is a nutch patch currently.

Michael Ji,

--- Fuad Efendi <fu...@efendi.ca> wrote:

> Some suggestion to improve performance:
> 
> 
> 1. Decrease randomization of FetchList.
>  
> Here is comment from FetchListTool:
>    /**
>      * The TableSet class will allocate a given FetchListEntry
>      * into one of several ArrayFiles.  It chooses
> which
>      * ArrayFile based on a hash of the URL's domain
> name.
>      *
>      * It uses a hash of the domain name so that
> pages are
>      * allocated to a random ArrayFile, but
> same-host pages
>      * go to the same file (for efficiency purposes
> during
>      * fetch).
>      *
>      * Further, within a given file, the
> FetchListEntry items
>      * appear in random order.  This is so that we
> don't
>      * hammer the same site over and over again
> during fetch.
>      *
>      * Each table should receive a roughly
>      * even number of entries, but all URLs for a 
> specific 
>      * domain name will be found in a single table. 
> If
>      * the dataset is weirdly skewed toward large
> domains,
>      * there may be an uneven distribution.
>      */
> 
> Same "same-host pages go to the same file" - they
> should go in a
> sequence, without mixing/randomizing with other
> host-pages...
> 
> We are fetching single URL, then we forget about
> existense of this
> TCP/IP connection, we even forget that Web Server
> created Client Process
> to handle our HTTP requests, it is called Keep
> Alive. Creation of TCP
> connection, and additionally creation of a such
> Client Process on a Web
> Server costs a lot of CPU on both sides, Nutch &
> WebServer.
> 
> I suggest to use single Keep-Alive thread to fetch
> single Host, without
> randomization.
> 
> 
> 2. Use/Investigate more staff from Socket API such
> as
> public void setSoTimeout(int timeout)
> public void setReuseAddress(true)
> 
> I found this in J2SE API for
> setReuseAddress(default: false):
> =====
> When a TCP connection is closed the connection may
> remain in a timeout
> state for a period of time after the connection is
> closed (typically
> known as the TIME_WAIT state or 2MSL wait state).
> For applications using
> a well known socket address or port it may not be
> possible to bind a
> socket to the required SocketAddress if there is a
> connection in the
> timeout state involving the socket address or port.
> =====
> 
> It probably means that we are reaching huge amount
> (65000!) of "waiting"
> TCP ports after Socket.close(); and Fetcher Theads
> are blocking by OS
> waiting when OS release some of these ports... Am I
> right?
> 
> 
> P.S.
> Anyway, using Keep-Alive option is very important
> not only for us but
> also for Production Web Sites.
> 
> Thanks,
> Fuad
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Friday, September 30, 2005 10:58 PM
> To: nutch-dev@lucene.apache.org; menoz@ngi.it
> Subject: RE: what contibute to fetch slowing down
> 
> 
> Dear Nutchers,
> 
> 
> I noticed same problem twise, with PentiumMobile2Mhz
> & WindowsXP & 2Gb,
> and with 2xOpteron252 x SuseLinux x 4Gb
> 
> I have only one explanation which should be probably
> mirrored at JIRA:
> 
> 
> ================
> Network.
> ========
> 
> 
> 1.
> I never had such a problem with The Grinder, 
> http://grinder.sourceforge.net, which is based on alternate HTTPClient
> http://www.innovation.ch/java/HTTPClient/index.html.
> Apache SF should
> really review their HttpClient RC3(!!!) accordingly,
> HTTPClient
> (upper--HTTP-case)is not "alpha", it is production
> version... I used
> Grinder a lot, it allows to execute 32 processes
> with 64 threads each on
> 2048Mb RAM...
> 
> 
> 2.
> I found at SUN API this:
> java.net.Socket
> public void setReuseAddress(boolean on) - please
> check API!!!
> 
> 
> 3.
> I saw in your PROTOCOL-HTTP this code:
> ... HTTP/1.0 ...
> Why? Why version 1.0??? It should understand
> server's reply such as
> "Connection: close" "Connection: keep-alive" etc.
> (pls ignore typo).
> 
> 
> 4.
> By the way, how many files UNIX needs in order to
> maintain 65536 network
> sockets?
> 
> 
> Respectfully,
> Fuad
> 
> P.S.
> Sorry guys, I don't have anough time to
> participate... Could you please
> test this suspicious behaviour, and very strange
> opinion? Should I
> create a new bug report at JIRA?
> 
> SUN's Socket, Apache's HttpClient, UNIX's
> networking...
> 
> 
> 
> 
> -----Original Message-----
> From: Daniele Menozzi [mailto:menoz@ngi.it]
> Sent: Wednesday, September 28, 2005 4:42 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: what contibute to fetch slowing down
> 
> 
> On  10:27:55 28/Sep , AJ Chen wrote:
> > I started the crawler with about 2000 sites.  The
> fetcher could
> > achieve
> > 7 pages/sec initially, but the performance
> gradually dropped to about
> 2
> > pages/sec, sometimes even 0.5 pages/sec.  The
> fetch list had 300k
> pages
> > and I used 500 threads. What are the main causes
> of this slowing down?
> 
> 
> I have the same problem; I've tried with different
> number of fetchers
> (10,20,50,100,..), but the download rate always
> decrease
=== message truncated ===



	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

RE: what contibute to fetch slowing down

Posted by Michael Ji <fj...@yahoo.com>.

Kelvin's OC implementation is queuing fetching request
according to the host and using http 1.1 protocol. It
is a nutch patch currently.

Michael Ji,

--- Fuad Efendi <fu...@efendi.ca> wrote:

> Some suggestion to improve performance:
> 
> 
> 1. Decrease randomization of FetchList.
>  
> Here is comment from FetchListTool:
>    /**
>      * The TableSet class will allocate a given
> FetchListEntry
>      * into one of several ArrayFiles.  It chooses
> which
>      * ArrayFile based on a hash of the URL's domain
> name.
>      *
>      * It uses a hash of the domain name so that
> pages are
>      * allocated to a random ArrayFile, but
> same-host pages
>      * go to the same file (for efficiency purposes
> during
>      * fetch).
>      *
>      * Further, within a given file, the
> FetchListEntry items
>      * appear in random order.  This is so that we
> don't
>      * hammer the same site over and over again
> during fetch.
>      *
>      * Each table should receive a roughly
>      * even number of entries, but all URLs for a 
> specific 
>      * domain name will be found in a single table. 
> If
>      * the dataset is weirdly skewed toward large
> domains,
>      * there may be an uneven distribution.
>      */
> 
> Same "same-host pages go to the same file" - they
> should go in a
> sequence, without mixing/randomizing with other
> host-pages...
> 
> We are fetching single URL, then we forget about
> existense of this
> TCP/IP connection, we even forget that Web Server
> created Client Process
> to handle our HTTP requests, it is called Keep
> Alive. Creation of TCP
> connection, and additionally creation of a such
> Client Process on a Web
> Server costs a lot of CPU on both sides, Nutch &
> WebServer.
> 
> I suggest to use single Keep-Alive thread to fetch
> single Host, without
> randomization.
> 
> 
> 2. Use/Investigate more staff from Socket API such
> as
> public void setSoTimeout(int timeout)
> public void setReuseAddress(true)
> 
> I found this in J2SE API for
> setReuseAddress(default: false):
> =====
> When a TCP connection is closed the connection may
> remain in a timeout
> state for a period of time after the connection is
> closed (typically
> known as the TIME_WAIT state or 2MSL wait state).
> For applications using
> a well known socket address or port it may not be
> possible to bind a
> socket to the required SocketAddress if there is a
> connection in the
> timeout state involving the socket address or port. 
> =====
> 
> It probably means that we are reaching huge amount
> (65000!) of "waiting"
> TCP ports after Socket.close(); and Fetcher Theads
> are blocking by OS
> waiting when OS release some of these ports... Am I
> right?
> 
> 
> P.S.
> Anyway, using Keep-Alive option is very important
> not only for us but
> also for Production Web Sites.
> 
> Thanks,
> Fuad
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca] 
> Sent: Friday, September 30, 2005 10:58 PM
> To: nutch-dev@lucene.apache.org; menoz@ngi.it
> Subject: RE: what contibute to fetch slowing down
> 
> 
> Dear Nutchers,
> 
> 
> I noticed same problem twise, with PentiumMobile2Mhz
> & WindowsXP & 2Gb,
> and with 2xOpteron252 x SuseLinux x 4Gb
> 
> I have only one explanation which should be probably
> mirrored at JIRA:
> 
> 
> ================
> Network.
> ========
> 
> 
> 1.
> I never had such a problem with The Grinder,
> http://grinder.sourceforge.net, which is based on
> alternate HTTPClient
> http://www.innovation.ch/java/HTTPClient/index.html.
> Apache SF should
> really review their HttpClient RC3(!!!) accordingly,
> HTTPClient
> (upper--HTTP-case)is not "alpha", it is production
> version... I used
> Grinder a lot, it allows to execute 32 processes
> with 64 threads each on
> 2048Mb RAM...
> 
> 
> 2.
> I found at SUN API this: 
> java.net.Socket
> public void setReuseAddress(boolean on) - please
> check API!!!
> 
> 
> 3. 
> I saw in your PROTOCOL-HTTP this code:
> ... HTTP/1.0 ...
> Why? Why version 1.0??? It should understand
> server's reply such as
> "Connection: close" "Connection: keep-alive" etc.
> (pls ignore typo).
> 
> 
> 4.
> By the way, how many files UNIX needs in order to
> maintain 65536 network
> sockets?
> 
> 
> Respectfully,
> Fuad
> 
> P.S.
> Sorry guys, I don't have anough time to
> participate... Could you please
> test this suspicious behaviour, and very strange
> opinion? Should I
> create a new bug report at JIRA? 
> 
> SUN's Socket, Apache's HttpClient, UNIX's
> networking...
> 
> 
> 
> 
> -----Original Message-----
> From: Daniele Menozzi [mailto:menoz@ngi.it] 
> Sent: Wednesday, September 28, 2005 4:42 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: what contibute to fetch slowing down
> 
> 
> On  10:27:55 28/Sep , AJ Chen wrote:
> > I started the crawler with about 2000 sites.  The
> fetcher could
> > achieve
> > 7 pages/sec initially, but the performance
> gradually dropped to about
> 2 
> > pages/sec, sometimes even 0.5 pages/sec.  The
> fetch list had 300k
> pages 
> > and I used 500 threads. What are the main causes
> of this slowing down?
> 
> 
> I have the same problem; I've tried with different
> number of fetchers
> (10,20,50,100,..), but the download rate always
> decrease 
=== message truncated ===



	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

RE: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.

Some suggestion to improve performance:

1. Decrease randomization of FetchList.

Here is comment from FetchListTool:
   /**
     * The TableSet class will allocate a given FetchListEntry
     * into one of several ArrayFiles.  It chooses which
     * ArrayFile based on a hash of the URL's domain name.
     *
     * It uses a hash of the domain name so that pages are
     * allocated to a random ArrayFile, but same-host pages
     * go to the same file (for efficiency purposes during
     * fetch).
     *
     * Further, within a given file, the FetchListEntry items
     * appear in random order.  This is so that we don't
     * hammer the same site over and over again during fetch.
     *
     * Each table should receive a roughly
     * even number of entries, but all URLs for a  specific 
     * domain name will be found in a single table.  If
     * the dataset is weirdly skewed toward large domains,
     * there may be an uneven distribution.
     */

Same "same-host pages go to the same file" - they should go in a
sequence, without mixing/randomizing with other host-pages...

We are fetching single URL, then we forget about existense of this
TCP/IP connection, we even forget that Web Server created Client Process
to handle our HTTP requests, it is called Keep Alive. Creation of TCP
connection, and additionally creation of a such Client Process on a Web
Server costs a lot of CPU on both sides, Nutch & WebServer.

I suggest to use single Keep-Alive thread to fetch single Host, without
randomization.

2. Use/Investigate more staff from Socket API such as
public void setSoTimeout(int timeout)
public void setReuseAddress(true)

I found this in J2SE API for setReuseAddress(default: false):
=====
When a TCP connection is closed the connection may remain in a timeout
state for a period of time after the connection is closed (typically
known as the TIME_WAIT state or 2MSL wait state). For applications using
a well known socket address or port it may not be possible to bind a
socket to the required SocketAddress if there is a connection in the
timeout state involving the socket address or port. 
=====

It probably means that we are reaching huge amount (65000!) of "waiting"
TCP ports after Socket.close(); and Fetcher Theads are blocking by OS
waiting when OS release some of these ports... Am I right?

P.S.
Anyway, using Keep-Alive option is very important not only for us but
also for Production Web Sites.

Thanks,
Fuad

-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, September 30, 2005 10:58 PM
To: nutch-dev@lucene.apache.org; menoz@ngi.it
Subject: RE: what contibute to fetch slowing down

Dear Nutchers,

I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb,
and with 2xOpteron252 x SuseLinux x 4Gb

I have only one explanation which should be probably mirrored at JIRA:

================
Network.
========

1.
I never had such a problem with The Grinder,
http://grinder.sourceforge.net, which is based on alternate HTTPClient
http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should
really review their HttpClient RC3(!!!) accordingly, HTTPClient
(upper--HTTP-case)is not "alpha", it is production version... I used
Grinder a lot, it allows to execute 32 processes with 64 threads each on
2048Mb RAM...

2.
I found at SUN API this: 
java.net.Socket
public void setReuseAddress(boolean on) - please check API!!!

3. 
I saw in your PROTOCOL-HTTP this code:
... HTTP/1.0 ...
Why? Why version 1.0??? It should understand server's reply such as
"Connection: close" "Connection: keep-alive" etc. (pls ignore typo).

4.
By the way, how many files UNIX needs in order to maintain 65536 network
sockets?

Respectfully,
Fuad

P.S.
Sorry guys, I don't have anough time to participate... Could you please
test this suspicious behaviour, and very strange opinion? Should I
create a new bug report at JIRA? 

SUN's Socket, Apache's HttpClient, UNIX's networking...

-----Original Message-----
From: Daniele Menozzi [mailto:menoz@ngi.it] 
Sent: Wednesday, September 28, 2005 4:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: what contibute to fetch slowing down

On  10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites.  The fetcher could
> achieve
> 7 pages/sec initially, but the performance gradually dropped to about
2 
> pages/sec, sometimes even 0.5 pages/sec.  The fetch list had 300k
pages 
> and I used 500 threads. What are the main causes of this slowing down?

I have the same problem; I've tried with different number of fetchers
(10,20,50,100,..), but the download rate always decrease sistematically,
page after page. The machine is a p4 1.7, 768 MB ram, running debian on
2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out),
but I cannot obtain a stable, and high, page/s rate. I've also tried to
change machine and kernel, but the problem still remains. Can you please
give us some advice? Thank you for your help,
	Menoz

-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org