You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/19 17:08:29 UTC

[jira] Created: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Max. pages to crawl/fetch per site (emergency limit)
----------------------------------------------------

         Key: NUTCH-272
         URL: http://issues.apache.org/jira/browse/NUTCH-272
     Project: Nutch
        Type: Improvement

    Reporter: Stefan Neufeind


If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412613 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

To my knowledge, no. I believe "generate.max.per.host parameter" merely restricts the URLs/host that can be in a given fetchlist. So on an infinite crawler trap, your crawler won't choke on an infinitely-large fetchlist, but instead continue gnawing away (inifinitely) at the URL space...

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412642 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

Ok, I just re-read Generator.java ( http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?view=markup )

 * Selector.map() keeps values where crawlDatum.getFetchTime() <= curTime

 * Selector.reduce() collects until "limit" is reached, optionally skipping the url if "hostCount.get() > maxPerHost"

So it caps _URLs/host going in this fetchlist_. Not total URLs/host. That's what I thought, and is insufficent for the reasons stated above. (Will incrementally fetch everything.)

If the cap is 50k and a host has 70k active URLs in the crawldb, what Generate needs to say is "Here are the first 50k URLs added for this site, and I see only 3 are scheduled. We'll put 3 in this fetchlist."

Generate can only enforce a limit if it knows which 50k were _first_ added to the db, and _never_ fetch
any of the latter 20k.

Hmm... it seems straightforward to modify Generate.java to count total URLs/host during map(), regardless of fetchTime. But I don't see what action we could take besides halting all fetches for the site. We'd have to traverse crawldb in order of record-creation time to be able to see which were the first N added to the crawldb. (i think the crawldb is sorted by url, not ctime)

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412846 ] 

Doug Cutting commented on NUTCH-272:
------------------------------------

In 0.8, urls are filtered both when generating and when updating the DB.  Strictly speaking, they're only required when updating the DB, but are also applied during generation to allow for changes to the filters.  They're also filtered during fetching when following redirects.

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412842 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

Agreed that it's looking tough to do in Generate. Alternately, we can try to keep the excess URLs from ever entering the crawldb in CrawlDb.update(). (has its own issues, noted above...)

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

Scratch my last comment. :-) I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied during Generate.

(Wow. This means the crawldb will accumulate lots of junk URLs over time. Is this a feature or a bug?)

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412621 ] 

Ken Krugler commented on NUTCH-272:
-----------------------------------

The generate.max.per.host parameter does work, but with the following limitations that we've run into:

1. The current code uses the entire hostname when deciding max links/host. There are a lot of spammy sites out there that have URLs with the form xxxx-somedomain.com, where xxxx is essentially a random number.

We've got code that does a better job of deriving the true "base' domain name, but then there's...

2. Sites that actually have many IP addresses (not sure if they're in a common subnet block or not), where the domain name is xxxx-somedomain.com.

Because of these two link farm techniques, we ran into cases of 100K links essentially being fetched from the same spam-laden domain, even with a generate.max.per.host setting of 50, after about 40+ loops.

And what's really unfortunate is that many of these sites are low-bandwidth hosters in Korea and China, so your crawl speed drops dramatically because you're spending all your time waiting for worthless bytes to arrive.

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?)

I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools.

Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost?

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning.

Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb". 

URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain _after_ traversing crawldb, during which time it merges the new URLs.

So, one way to approach it is:

* Count URLs/domain during "update". If a domain exceeds the limit, write to a file.

* Read this file at the start of "update" (next cycle) and block further additions

* Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter()

If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch.

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Matt Kangas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ] 

Matt Kangas commented on NUTCH-272:
-----------------------------------

btw, I'd love to be proven wrong, because if "generate.max.per.host parameter" works as a hard URL cap per site, I could be sleeping better quite soon. :)

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "alan wootton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412833 ] 

alan wootton commented on NUTCH-272:
------------------------------------

I don't think you can get whet you want from any change to either of the map-reduce jobs that Generate is composed of. 
What you might need to do is to write another m-r job to precede the Generate.

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412605 ] 

Doug Cutting commented on NUTCH-272:
------------------------------------

Does the existing generate.max.per.host parameter not meet this need?

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412620 ] 

Stefan Neufeind commented on NUTCH-272:
---------------------------------------

Oh, I just discovered this new parameter was added in 0.8-dev :-)

But to my understanding of the description in nutch-default.xml this only applies to "per fetchlist". And that would mean "for one run", right? So in case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But what if there is one document on the first level (theoretically) with 200 links in it? In this case I suspect that they are all written to the webdb as "to-do" in the first run, in the next the first 100 are fetched with rest skipped and upon another round the next 100 are fetched? Is that right?

My idea was also to have this as a "per host" or "per site"-setting - or to be able to override the value for a certain host ...

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency limit" to fetch a certain max. number of pages per site. Is there an "easy" way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira