You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Olena Medelyan <ol...@cs.waikato.ac.nz> on 2006/03/21 05:46:10 UTC

How to terminate the crawl?

Hi,

I'm using the crawl tool in nutch to crawl web starting from a set of 
URL seeds. The crawl normally finishes after the specified depth was 
reached. Is it possible to terminate after a pre-defined number of pages 
or a text data of a pre-defined size (e.g. 500 MB) has been crawled? 
Thank you for any hints!

Regards,
Olena

Re: How to terminate the crawl?

Posted by Doug Cutting <cu...@apache.org>.

You can limit the number of pages by using the -topN parameter.  This 
limits the number of pages fetched in each round.  Pages are prioritized 
by how well-linked they are.  The maximum number of pages that can be 
fetched is topN*depth.

Doug

Olena Medelyan wrote:
> Hi,
> 
> I'm using the crawl tool in nutch to crawl web starting from a set of 
> URL seeds. The crawl normally finishes after the specified depth was 
> reached. Is it possible to terminate after a pre-defined number of pages 
> or a text data of a pre-defined size (e.g. 500 MB) has been crawled? 
> Thank you for any hints!
> 
> Regards,
> Olena
>

RE: How to terminate the crawl?

Posted by Alexander Hixon <ad...@aquabeta.net>.

You could write a shell script, to be executed via a Cron job every minute
or so, to stat the temp file directory, and if the size is over the set
limit, terminate the java thread. Or, if you can program sufficiently, add
some Java to the crawler code.

As such, I don't believe there is any setting in the configuration files
that allows you to do such a thing.

Regards,
Alexander

-----Original Message-----
From: Olena Medelyan [mailto:olena@cs.waikato.ac.nz] 
Sent: Tuesday, 21 March 2006 3:46 PM
To: nutch-user@lucene.apache.org
Subject: How to terminate the crawl?

Hi,

I'm using the crawl tool in nutch to crawl web starting from a set of 
URL seeds. The crawl normally finishes after the specified depth was 
reached. Is it possible to terminate after a pre-defined number of pages 
or a text data of a pre-defined size (e.g. 500 MB) has been crawled? 
Thank you for any hints!

Regards,
Olena