You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dietrich <di...@gmail.com> on 2011/06/20 16:54:12 UTC

How to remove domain from Nutch DB

How can one remove documents from a specific domain from an existing Nutch db?
Addding a filter to regex-urlfilter.txt seems to prevent them from
being added to the linkDb, but documents already in there are not
affected at all, and I could not see how else to do this.
It can't possibly be that I have to completely recreate the crawl folder, is it?

Re: How to remove domain from Nutch DB

Posted by Markus Jelsma <ma...@openindex.io>.
Updating the crawldb with all segments should work. Don't forget the -filter 
option.

On Monday 20 June 2011 16:54:12 Dietrich wrote:
> How can one remove documents from a specific domain from an existing Nutch
> db? Addding a filter to regex-urlfilter.txt seems to prevent them from
> being added to the linkDb, but documents already in there are not
> affected at all, and I could not see how else to do this.
> It can't possibly be that I have to completely recreate the crawl folder,
> is it?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350