You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by amit sehas <cu...@yahoo.com> on 2014/11/04 19:26:23 UTC

Nutch 2.X question

I have a small question about Nutch 2.X source code, i hope this is the right mailing list for
that. i was unable to locate the following pieces from the code:

a) where does the linkdb get generated, which java file contains the code for that

b) i see the WebPage class being utilized for remembering the pages that were
  gathered. It looks like the crawldb is a repository of these pages. If that is
  the case then:

  -- it looks like WepPage remembers the contents of the page together with the
    rest of the information about the page. How do we delete content which is
    old and not changed for a while

 -- it does not appear that Nutch 2.X has any concept of segments. How do we
    delete stuff that is older than 1 month so that we dont blow out the disk space ?

thanks