You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kamil Wnuk <ka...@gmail.com> on 2005/09/12 21:46:44 UTC

why are unfetchable sites kept in webdb?

In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have 
remained unreachable for a certain number of retries to never be fetched. Is 
there a compelling reason to keep such pages around? It seems like the right 
thing to do in this case would be to just remove the page from the webdb 
with "webdb.deletePage( oldPage )" in order to keep the webdb from 
accumulating data about pages that no longer exist. I would be happy to 
submit this change if anyone is interested, otherwise please let me know why 
the current implementation is necessary.

Thank you,
Kamil

Re: why are unfetchable sites kept in webdb?

Posted by Doug Cutting <cu...@nutch.org>.
Kamil Wnuk wrote:
> In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have 
> remained unreachable for a certain number of retries to never be fetched. Is 
> there a compelling reason to keep such pages around? It seems like the right 
> thing to do in this case would be to just remove the page from the webdb 
> with "webdb.deletePage( oldPage )" in order to keep the webdb from 
> accumulating data about pages that no longer exist. I would be happy to 
> submit this change if anyone is interested, otherwise please let me know why 
> the current implementation is necessary.

The purpose of this is that, to not waste time trying to fetch these 
pages are if other references to them are encountered.  Imagine a large 
site with a bad link on it that you re-crawl frequently.  Should we keep 
re-learning that the link is bad, or should we remember?

Doug