You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by manubharghav <ma...@gmail.com> on 2012/12/14 07:32:35 UTC

identify domains from fetch lists taking lot of time.

Hi,

I initiated a crawl on 200 domains till a depth of 5 with a topN of 1
million.  A single domain extended my fetch time by a day as it kept
generating outlinks to the same page with different urls( the parameters
change, but the content remains same.)
.http://www.awex.com.au/about-awex.html?s=___________.    So is there anyway
to run the content dedup while fetching itself or are there any other steps
to avoid such cases. The problem is that as the size of the fetch list is
increasing the fetcher has a delay of say 3 seconds hitting the same server.
This is causing the delay in the node and hence delaying the effective time
of the crawl.


Thanks in advance.
Manu Reddy.



--
View this message in context: http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: identify domains from fetch lists taking lot of time.

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - you have to get rid of those URL's via URL filters. If you cannot filter them out you can set the fetcher time limit (see nutch-default) to limit the time the fetcher runs or set the fetcher minumum throughput (see nutch-default). The latter will abort the fetcher if less than N pages/second are fetched. The unfetched records will be fetched later on together with other queues. 
 
-----Original message-----
> From:manubharghav <ma...@gmail.com>
> Sent: Fri 14-Dec-2012 07:39
> To: user@nutch.apache.org
> Subject: identify domains from fetch lists taking lot of time.
> 
> Hi,
> 
> I initiated a crawl on 200 domains till a depth of 5 with a topN of 1
> million.  A single domain extended my fetch time by a day as it kept
> generating outlinks to the same page with different urls( the parameters
> change, but the content remains same.)
> .http://www.awex.com.au/about-awex.html?s=___________.    So is there anyway
> to run the content dedup while fetching itself or are there any other steps
> to avoid such cases. The problem is that as the size of the fetch list is
> increasing the fetcher has a delay of say 3 seconds hitting the same server.
> This is causing the delay in the node and hence delaying the effective time
> of the crawl.
> 
> 
> Thanks in advance.
> Manu Reddy.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>