You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Matt Kangas (JIRA)" <ji...@apache.org> on 2006/05/20 00:54:30 UTC

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning.

Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb". 

URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain _after_ traversing crawldb, during which time it merges the new URLs.

So, one way to approach it is:

* Count URLs/domain during "update". If a domain exceeds the limit, write to a file.

* Read this file at the start of "update" (next cycle) and block further additions

* Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter()

If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch.

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira