You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2005/10/13 22:35:18 UTC

how to make fetcher to use the full bandwidth

I try to fetch as fast as it can by using more threads on a large fetch
list. But, the fetcher starts download at speed much lower than the full
bandwidth allows. And the start download speed varies a lot from run to run,
200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line
that I just tested.
Could someone share experience on how to make fetcher use the full
bandwidth? We know the speed drops gradually during a long fetch run. But,
can the fetch achieve the highest speed allowed by the bandwidth when fetch
starts?

AJ

Re: how to make fetcher to use the full bandwidth

Posted by Jon Shoberg <jo...@shoberg.net>.

I like to use vnstat to monitor bandwidth.  Jsut keep adding threads as 
long as the CPU/memory/pipe keep holding up.

http://humdi.net/vnstat/


-j


Re: how to make fetcher to use the full bandwidth

Posted by Jon Shoberg <jo...@shoberg.net>.

I like to use vnstat to monitor bandwidth.  Jsut keep adding threads as 
long as the CPU/memory/pipe keep holding up.

http://humdi.net/vnstat/


-j


Re: how to make fetcher to use the full bandwidth

Posted by AJ Chen <ca...@gmail.com>.
Thanks, Rod. Were you always able to fill the pipe under the same
conditions? I'm puzzling by the difference in fetch speed even when the same
number of threads and root urls are used.

I don't have local DNS server yet. To avoid overwhelming ISP's DNS server, I
use only 10 threads for the first run of fetch and so the fetch speed is
expected not great in this run. But, in the second fetch run, I use 500
threads and it can fill the pipe sometimes, but most of time uses 1/5 of the
pipe. The number of hosts, >1500, may be small. How many hosts are usually
used in your crawl?

AJ


On 10/13/05, Rod Taylor <rb...@sitesell.com> wrote:
>
> On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote:
> > I try to fetch as fast as it can by using more threads on a large fetch
> > list. But, the fetcher starts download at speed much lower than the full
> > bandwidth allows. And the start download speed varies a lot from run to
> run,
> > 200kb/s to 1200kb/s on my DSL line. This variation also happens on T1
> line
> > that I just tested.
> > Could someone share experience on how to make fetcher use the full
> > bandwidth? We know the speed drops gradually during a long fetch run.
> But,
> > can the fetch achieve the highest speed allowed by the bandwidth when
> fetch
> > starts?
>
> I found that for high bandwidth (50Mbits and above) DNS seems to be a
> limiting factor.
>
> 4000 threads with a local caching DNS server seems to be enough to fill
> the pipe though
>
> --
> Rod Taylor <rb...@sitesell.com>
>
>

Re: how to make fetcher to use the full bandwidth

Posted by Rod Taylor <rb...@sitesell.com>.
On Thu, 2005-10-13 at 16:42 -0400, Rod Taylor wrote:
> On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote:
> > I try to fetch as fast as it can by using more threads on a large fetch
> > list. But, the fetcher starts download at speed much lower than the full
> > bandwidth allows. And the start download speed varies a lot from run to run,
> > 200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line
> > that I just tested.
> > Could someone share experience on how to make fetcher use the full
> > bandwidth? We know the speed drops gradually during a long fetch run. But,
> > can the fetch achieve the highest speed allowed by the bandwidth when fetch
> > starts?

> 4000 threads with a local caching DNS server seems to be enough to fill
> the pipe though

Your also limited by the number of servers you are connecting out to
since Nutch will by default limit itself to asking for a single page at
a time from a single server.

-- 
Rod Taylor <rb...@sitesell.com>


Re: how to make fetcher to use the full bandwidth

Posted by Rod Taylor <rb...@sitesell.com>.
On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote:
> I try to fetch as fast as it can by using more threads on a large fetch
> list. But, the fetcher starts download at speed much lower than the full
> bandwidth allows. And the start download speed varies a lot from run to run,
> 200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line
> that I just tested.
> Could someone share experience on how to make fetcher use the full
> bandwidth? We know the speed drops gradually during a long fetch run. But,
> can the fetch achieve the highest speed allowed by the bandwidth when fetch
> starts?

I found that for high bandwidth (50Mbits and above) DNS seems to be a
limiting factor.

4000 threads with a local caching DNS server seems to be enough to fill
the pipe though

-- 
Rod Taylor <rb...@sitesell.com>