You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Emilijan Mirceski <em...@cpuedge.com> on 2005/07/08 00:52:50 UTC

max fetcher threads per host, buggy behaviour.

There is a problem with max threats per host I'm experiencing right now. 

Nutch is completely ignoring 'maximum threads per host' and delay after one
thread finishes with a host.

I have the version from 6/24.
The problem is there regardless if I go with the default settings (put
nothing in nutch-site.xml regarding the fetcher) or I specify fetcher
threads=20.

To reproduce:
Fetch something in several segments. 
Merge several segments.
Replace in the configuration of regex-urlfilter.txt:
-[?*!@=]
with 
-[*!@]
because I want to crawl all the forums in my target sites.

Delete the database, and recreate it again. (updatedb)
Start fetching again.

At this point I can see 20 urls to the same host being fetched. And bunch of
errors happening because the target sites cannot serve me 20 pages per 10
seconds.

Is this because I'm excluding the default "?=" or... ? Any idea how to fetch
maximum 1 page per host per fetching run?

I partially solved the problem my splitting the fetching workload in 20
segments and fetching 3-5 threads per segment, but this isn't nice solution
as I have to micro-manage all the fetch segments and merge them afterward.

E.

-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org] 
Sent: Thursday, July 07, 2005 2:25 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Problems with Fetcher threads?

Are you just crawling a single site?  Just one?  What is 
fetcher.threads.per.host?   It is one by default, but only if 
fetcher.threads.per.host is greater than one will the fetcher be able to 
effectively use multiple threads to crawl a single site.  Otherwise 
these threads will conflict and fail to fetch pages.

Doug

Jakob Heidebrecht wrote:
> Hallo,
> 
> Is there a problem of fetching with many threads?
> 
> I injected a single URL to the DB and fetched in each case three circles.
> 
> First case 1 fetcher thread, second and third 20 fetcher threads.
> 
> In the first case I got 102 pages,
> in the sekond 19 pages and 
> in the third 22 pages.
> 
> Everything else was the same all the time.
> 
> Is this a bug?
> May the server kick me out whet I'm fetching it with many threads at the
> same time?
> 
> Regards,
> 
> Jakob
>