You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kamil Wnuk <ka...@gmail.com> on 2005/09/12 21:46:44 UTC
why are unfetchable sites kept in webdb?
In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have
remained unreachable for a certain number of retries to never be fetched. Is
there a compelling reason to keep such pages around? It seems like the right
thing to do in this case would be to just remove the page from the webdb
with "webdb.deletePage( oldPage )" in order to keep the webdb from
accumulating data about pages that no longer exist. I would be happy to
submit this change if anyone is interested, otherwise please let me know why
the current implementation is necessary.
Thank you,
Kamil
Re: why are unfetchable sites kept in webdb?
Posted by Doug Cutting <cu...@nutch.org>.
Kamil Wnuk wrote:
> In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have
> remained unreachable for a certain number of retries to never be fetched. Is
> there a compelling reason to keep such pages around? It seems like the right
> thing to do in this case would be to just remove the page from the webdb
> with "webdb.deletePage( oldPage )" in order to keep the webdb from
> accumulating data about pages that no longer exist. I would be happy to
> submit this change if anyone is interested, otherwise please let me know why
> the current implementation is necessary.
The purpose of this is that, to not waste time trying to fetch these
pages are if other references to them are encountered. Imagine a large
site with a bad link on it that you re-crawl frequently. Should we keep
re-learning that the link is bad, or should we remember?
Doug