You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by amit sehas <cu...@yahoo.com> on 2014/11/04 19:26:23 UTC
Nutch 2.X question
I have a small question about Nutch 2.X source code, i hope this is the right mailing list for
that. i was unable to locate the following pieces from the code:
a) where does the linkdb get generated, which java file contains the code for that
b) i see the WebPage class being utilized for remembering the pages that were
gathered. It looks like the crawldb is a repository of these pages. If that is
the case then:
-- it looks like WepPage remembers the contents of the page together with the
rest of the information about the page. How do we delete content which is
old and not changed for a while
-- it does not appear that Nutch 2.X has any concept of segments. How do we
delete stuff that is older than 1 month so that we dont blow out the disk space ?
thanks