You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Camp <jc...@vhosting.com> on 2006/04/25 06:55:38 UTC

Question about crawl expectations

Hi,
  I'm trying to gage whether one crawl server is performing well, and 
I'm having a tough time trying to determine if I could increase settings 
to gain faster crawls, or if I'm approaching the max the server can 
handle. The server is a dual AMD Althon 2200 with 2GB of ram hanging off 
of a dedicated 10Mb connection. When processing 1 million url segment, I 
see these speeds in the log:

281147 pages, 142413 errors, 11.4 pages/s, 1918 kb/s,
281158 pages, 142422 errors, 11.4 pages/s, 1918 kb/s,
281170 pages, 142428 errors, 11.4 pages/s, 1918 kb/s,
281188 pages, 142430 errors, 11.4 pages/s, 1918 kb/s,
281206 pages, 142444 errors, 11.4 pages/s, 1918 kb/s,
281218 pages, 142452 errors, 11.4 pages/s, 1918 kb/s,

It takes about 29 hours to process this segment, It begins on 04/21 at 
10:23pm. It starts running mapreduce on 04/22 at 6:25pm. It finishes at 
04/23 at 0156am.

I understand that the time and speed of fetching is totally dependant on 
the type of content that's being fetched, but I'm sure there's an 
average speed for a particular type of configuration, if anyone can help 
me out or needs anything explained out better, please let me know. thank 
you!

Jason


Re: Question about crawl expectations

Posted by Shawn Gervais <pr...@project10.net>.
Jason Camp wrote:
> Hi,
>  I'm trying to gage whether one crawl server is performing well, and I'm 
> having a tough time trying to determine if I could increase settings to 
> gain faster crawls, or if I'm approaching the max the server can handle. 
> The server is a dual AMD Althon 2200 with 2GB of ram hanging off of a 
> dedicated 10Mb connection. When processing 1 million url segment, I see 
> these speeds in the log:
> 
> 281147 pages, 142413 errors, 11.4 pages/s, 1918 kb/s,

What do you have the "fetcher.threads.fetch" value set to (in 
nutch-site.xml)?

You may also want to ensure you are using good values for 
http.max.delays and http.timeout.

With a similar machine I am able to pull ~30 pages/sec using the 
following settings:

	- http.max.delays 5
	- http.timeout 5000
	- fetcher.threads.fetch 256

HTH,
-Shawn