You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/03/08 20:13:35 UTC

CrawlDb Filter tool, was Re: svn commit: r384219 -

Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226
Give it a try and let me know if that works for you, in any case  
backup your crawlDb first!!!
I tested it only with a small crawlDb, so it is your own risk. :)

Stefan

Am 08.03.2006 um 19:47 schrieb Rod Taylor:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> ab@apache.org wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com