You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael <mi...@gameservice.ru> on 2005/10/03 02:36:45 UTC

Re[2]: what contibute to fetch slowing down

3mbit, 100 threads = 15 pages/sec
cpu is low during fetch, so its bandwidth limit.

AC> More than 1 million pages were fetched, but it took several days at current
AC> speed - just too slow. I'm planning to get more bandwidth. Could someone
AC> share their experience on what stable rate (pages/sec) can be achieved using
AC> 3 mbps or 10 mbps inbound connection?


Michael


RE: Re[2]: what contibute to fetch slowing down

Posted by og...@yahoo.com.
Fuad,

I think you are constantly comparing apples and oranges here.  It looks
like your new code simply hammers the server sending multiple requests
to a single server in parallel.  That's a big no-no in a web
crawling/spidering/fetching world, as bad as not obeying robots.txt.

The fact that the speed difference is SO large is a clear hint that the
comparison may not be right, and that the 3 plugins you are comparing
are configured very differently.

Otis


--- Fuad Efendi <fu...@efendi.ca> wrote:

> Try new Protocol-HTTPClient-Innovation:
> http://issues.apache.org/jira/browse/NUTCH-109
> 
> 
> -----Original Message-----
> From: Daniele Menozzi [mailto:menoz@ngi.it] 
> Sent: Monday, October 10, 2005 5:42 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Re[2]: what contibute to fetch slowing down
> 
> 
> On  03:36:45 03/Oct , Michael wrote:
> > 3mbit, 100 threads = 15 pages/sec
> > cpu is low during fetch, so its bandwidth limit.
> 
> yes, cpu is low, and even memory is quite free. But, with a 10MB
> in/out I
> cannot obtain good results (and I do not parse results, simply fetch
> them).
> If I use 100 threads, I can download pages at 500KB/s for about 5
> seconds,
> but after that, the download rate falls to 0. If I set 20 threads, I
> can
> download 
> at 200KB for 4/5 minutes, and the rate initially seems very stable.
> But,
> after theese few minutes, the rate starts to get lower and lower, and
> tends
> to reach zero pages/s.
> 
> I cannot understand what could be the problem. Every thread number I
> choose,
> the rate _always_ decrease, till it has reached 1/2 pages/s. I;ve
> tried 2
> different machines, but the problem is always the same.
> 
> Can you please give me some advices?
> Thank you
> 	Daniele
> 
> 
> 
> -- 
> 		      Free Software Enthusiast
> 		 Debian Powered Linux User #332564 
> 		     http://menoz.homelinux.org
> 
> 
> 


RE: Re[2]: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.
Try new Protocol-HTTPClient-Innovation:
http://issues.apache.org/jira/browse/NUTCH-109


-----Original Message-----
From: Daniele Menozzi [mailto:menoz@ngi.it] 
Sent: Monday, October 10, 2005 5:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Re[2]: what contibute to fetch slowing down


On  03:36:45 03/Oct , Michael wrote:
> 3mbit, 100 threads = 15 pages/sec
> cpu is low during fetch, so its bandwidth limit.

yes, cpu is low, and even memory is quite free. But, with a 10MB in/out I
cannot obtain good results (and I do not parse results, simply fetch them).
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can
download 
at 200KB for 4/5 minutes, and the rate initially seems very stable. But,
after theese few minutes, the rate starts to get lower and lower, and tends
to reach zero pages/s.

I cannot understand what could be the problem. Every thread number I choose,
the rate _always_ decrease, till it has reached 1/2 pages/s. I;ve tried 2
different machines, but the problem is always the same.

Can you please give me some advices?
Thank you
	Daniele



-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org



RE: Re[2]: what contibute to fetch slowing down

Posted by Fuad Efendi <fu...@efendi.ca>.
http://nagoya.apache.org/jira
- it does not work right now, I am trying to upload new Http-Plugin which
seems to be 100 times faster.

1. TCP connection costs a lot, not only for Nutch and end-point but also for
intermediary network equipment
2. Web Server creates Client thread and hopes that Nutch really uses
HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
"Socket.close()"
...

I need to perform very objective tests, probably 2-3 days; new plugin
crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
http-plugin needs few days...

I am using separate network segment with Windows XP (Nutch), and Suse Linux
(Apache HTTPD + 120,000 pages)



-----Original Message-----
From: Daniele Menozzi [mailto:menoz@ngi.it] 
Sent: Monday, October 10, 2005 5:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Re[2]: what contibute to fetch slowing down


On  03:36:45 03/Oct , Michael wrote:
> 3mbit, 100 threads = 15 pages/sec
> cpu is low during fetch, so its bandwidth limit.

yes, cpu is low, and even memory is quite free. But, with a 10MB in/out I
cannot obtain good results (and I do not parse results, simply fetch them).
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can
download 
at 200KB for 4/5 minutes, and the rate initially seems very stable. But,
after theese few minutes, the rate starts to get lower and lower, and tends
to reach zero pages/s.

I cannot understand what could be the problem. Every thread number I choose,
the rate _always_ decrease, till it has reached 1/2 pages/s. I;ve tried 2
different machines, but the problem is always the same.

Can you please give me some advices?
Thank you
	Daniele



-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org



Re: Re[2]: what contibute to fetch slowing down

Posted by Daniele Menozzi <me...@ngi.it>.
On  03:36:45 03/Oct , Michael wrote:
> 3mbit, 100 threads = 15 pages/sec
> cpu is low during fetch, so its bandwidth limit.

yes, cpu is low, and even memory is quite free. But, with a 10MB in/out
I cannot obtain good results (and I do not parse results, simply fetch
them).
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can download 
at 200KB for 4/5 minutes, and the rate initially seems very stable. But, after
theese few minutes, the rate starts to get lower and lower, and tends to reach
zero pages/s.

I cannot understand what could be the problem. Every thread number I choose, the rate _always_
decrease, till it has reached 1/2 pages/s.
I;ve tried 2 different machines, but the problem is always the same.

Can you please give me some advices?
Thank you
	Daniele



-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org