You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Christophe Noel <ch...@cetic.be> on 2005/08/02 12:09:45 UTC

Fetcher delays - benchmarks

Hello,

Following to some discussions, developpers mails, ... I tried to get the 
best performances (pages/second) for the following case :

- 120 web servers to crawl
- 10 Mbits/s connexion

I reached about 3 Mbits/s average fetching speed with following 
parameters (unpolite mode) :

- fetcher.server.delay = 1.0
- fetcher.per.host = 20
- threads = 800
- http.timeout = 5000

I see that Nutch is very slow for the first minuts ... performances 
increase with time : it is now at 2500 kb/s and was at 2000kb/s 5 
minutes ago.

 segment 20050802115311, 7200 pages, 446 errors, 231654440 bytes, 706020 ms
050802 120623 148 status: 10.198011 pages/s, 2563.3838 kb/s, 32174.227 
bytes/page

I read Doug Cutting mail about fetcher.max.delay, but i still don't 
understand how i cannot reach 10 mbits/s speed with 120 different servers.

Any tips to increase my performances please ?


Thank you very much.

Christophe Noël
Cetic Grid Data Mining

Re: Fetcher delays - benchmarks

Posted by Jay Pound <we...@poundwebhosting.com>.
yeah, I can saturate 10 mbit with 150 threads, any more and webpages will
drop. check your error rate when downloading if your having a lot of error
pages then drop back your threads to a lower number. your going to have to
find the sweet spot for your connection!!!
-J
----- Original Message ----- 
From: "Christophe Noel" <ch...@cetic.be>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 8:53 AM
Subject: Re: Fetcher delays - benchmarks


> Ok thank you very much.
>
> Something strange : i tried with 1600 threads (!!!!) instead of 800 and
> it goes from 2,5 Mbits (average) to 5 ...
>
> Isn't these parameter (1600 threads) really to big numbers ???
>
>
> Jay Pound wrote:
>
> >I'm able to easily saturate my 10mbit connx, but it takes a powerful
> >computer, if your computer is not so powerful try to fetch with
> >the -noParsing flag, it will offload the parsing processing untill later,
> >even a quad pentium 3 xeon 700mhz with 4gb of ram can only saturate about
> >5mbit, I've used 3ghz xeon w hyperthreading and it can do 10mbit (barely)
> >with parsing on, my new dual core opteron has about 10% cpu load with
> >parsing on and my athlon 64 3500+ can also do it just fine.
> >-J
> >PS: if you have a slow(er) computer fetch without parsing you can use a
> >faster computer to parse the data after the fetch is completed.
> >
> >BTW: for those who do not know it takes about 10% upstream bandwidth to
> >fetch webpages with 100 threads, so if you have a 10mbit connx but only
> >512kbit upload your max download is around 5-6mbit
> >found this out with roadrunners gamer connx 10mbit in 512kbit out
> >----- Original Message ----- 
> >From: "Christophe Noel" <ch...@cetic.be>
> >To: <nu...@incubator.apache.org>
> >Sent: Tuesday, August 02, 2005 6:09 AM
> >Subject: Fetcher delays - benchmarks
> >
> >
> >
> >
> >>Hello,
> >>
> >>Following to some discussions, developpers mails, ... I tried to get the
> >>best performances (pages/second) for the following case :
> >>
> >>- 120 web servers to crawl
> >>- 10 Mbits/s connexion
> >>
> >>I reached about 3 Mbits/s average fetching speed with following
> >>parameters (unpolite mode) :
> >>
> >>- fetcher.server.delay = 1.0
> >>- fetcher.per.host = 20
> >>- threads = 800
> >>- http.timeout = 5000
> >>
> >>I see that Nutch is very slow for the first minuts ... performances
> >>increase with time : it is now at 2500 kb/s and was at 2000kb/s 5
> >>minutes ago.
> >>
> >> segment 20050802115311, 7200 pages, 446 errors, 231654440 bytes, 706020
> >>
> >>
> >ms
> >
> >
> >>050802 120623 148 status: 10.198011 pages/s, 2563.3838 kb/s, 32174.227
> >>bytes/page
> >>
> >>I read Doug Cutting mail about fetcher.max.delay, but i still don't
> >>understand how i cannot reach 10 mbits/s speed with 120 different
servers.
> >>
> >>Any tips to increase my performances please ?
> >>
> >>
> >>Thank you very much.
> >>
> >>Christophe Noël
> >>Cetic Grid Data Mining
> >>
> >>
> >>
> >>
> >
> >
> >
> >
>



Re: Fetcher delays - benchmarks

Posted by Christophe Noel <ch...@cetic.be>.
Ok thank you very much.

Something strange : i tried with 1600 threads (!!!!) instead of 800 and 
it goes from 2,5 Mbits (average) to 5 ...

Isn't these parameter (1600 threads) really to big numbers ???


Jay Pound wrote:

>I'm able to easily saturate my 10mbit connx, but it takes a powerful
>computer, if your computer is not so powerful try to fetch with
>the -noParsing flag, it will offload the parsing processing untill later,
>even a quad pentium 3 xeon 700mhz with 4gb of ram can only saturate about
>5mbit, I've used 3ghz xeon w hyperthreading and it can do 10mbit (barely)
>with parsing on, my new dual core opteron has about 10% cpu load with
>parsing on and my athlon 64 3500+ can also do it just fine.
>-J
>PS: if you have a slow(er) computer fetch without parsing you can use a
>faster computer to parse the data after the fetch is completed.
>
>BTW: for those who do not know it takes about 10% upstream bandwidth to
>fetch webpages with 100 threads, so if you have a 10mbit connx but only
>512kbit upload your max download is around 5-6mbit
>found this out with roadrunners gamer connx 10mbit in 512kbit out
>----- Original Message ----- 
>From: "Christophe Noel" <ch...@cetic.be>
>To: <nu...@incubator.apache.org>
>Sent: Tuesday, August 02, 2005 6:09 AM
>Subject: Fetcher delays - benchmarks
>
>
>  
>
>>Hello,
>>
>>Following to some discussions, developpers mails, ... I tried to get the
>>best performances (pages/second) for the following case :
>>
>>- 120 web servers to crawl
>>- 10 Mbits/s connexion
>>
>>I reached about 3 Mbits/s average fetching speed with following
>>parameters (unpolite mode) :
>>
>>- fetcher.server.delay = 1.0
>>- fetcher.per.host = 20
>>- threads = 800
>>- http.timeout = 5000
>>
>>I see that Nutch is very slow for the first minuts ... performances
>>increase with time : it is now at 2500 kb/s and was at 2000kb/s 5
>>minutes ago.
>>
>> segment 20050802115311, 7200 pages, 446 errors, 231654440 bytes, 706020
>>    
>>
>ms
>  
>
>>050802 120623 148 status: 10.198011 pages/s, 2563.3838 kb/s, 32174.227
>>bytes/page
>>
>>I read Doug Cutting mail about fetcher.max.delay, but i still don't
>>understand how i cannot reach 10 mbits/s speed with 120 different servers.
>>
>>Any tips to increase my performances please ?
>>
>>
>>Thank you very much.
>>
>>Christophe Noël
>>Cetic Grid Data Mining
>>
>>
>>    
>>
>
>
>  
>


Re: Fetcher delays - benchmarks

Posted by Jay Pound <we...@poundwebhosting.com>.
I'm able to easily saturate my 10mbit connx, but it takes a powerful
computer, if your computer is not so powerful try to fetch with
the -noParsing flag, it will offload the parsing processing untill later,
even a quad pentium 3 xeon 700mhz with 4gb of ram can only saturate about
5mbit, I've used 3ghz xeon w hyperthreading and it can do 10mbit (barely)
with parsing on, my new dual core opteron has about 10% cpu load with
parsing on and my athlon 64 3500+ can also do it just fine.
-J
PS: if you have a slow(er) computer fetch without parsing you can use a
faster computer to parse the data after the fetch is completed.

BTW: for those who do not know it takes about 10% upstream bandwidth to
fetch webpages with 100 threads, so if you have a 10mbit connx but only
512kbit upload your max download is around 5-6mbit
found this out with roadrunners gamer connx 10mbit in 512kbit out
----- Original Message ----- 
From: "Christophe Noel" <ch...@cetic.be>
To: <nu...@incubator.apache.org>
Sent: Tuesday, August 02, 2005 6:09 AM
Subject: Fetcher delays - benchmarks


> Hello,
>
> Following to some discussions, developpers mails, ... I tried to get the
> best performances (pages/second) for the following case :
>
> - 120 web servers to crawl
> - 10 Mbits/s connexion
>
> I reached about 3 Mbits/s average fetching speed with following
> parameters (unpolite mode) :
>
> - fetcher.server.delay = 1.0
> - fetcher.per.host = 20
> - threads = 800
> - http.timeout = 5000
>
> I see that Nutch is very slow for the first minuts ... performances
> increase with time : it is now at 2500 kb/s and was at 2000kb/s 5
> minutes ago.
>
>  segment 20050802115311, 7200 pages, 446 errors, 231654440 bytes, 706020
ms
> 050802 120623 148 status: 10.198011 pages/s, 2563.3838 kb/s, 32174.227
> bytes/page
>
> I read Doug Cutting mail about fetcher.max.delay, but i still don't
> understand how i cannot reach 10 mbits/s speed with 120 different servers.
>
> Any tips to increase my performances please ?
>
>
> Thank you very much.
>
> Christophe Noël
> Cetic Grid Data Mining
>
>