You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2006/03/08 19:47:06 UTC

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
> Doug Cutting wrote:
> > ab@apache.org wrote:
> >> Don't generate URLs that don't pass URLFilters.
> >
> > Just to be clear, this is to support folks changing their filters 
> > while they're crawling, right?  We already filter before we 
> 
> Yes, and this seems to be the most common case. This is especially 
> important since there are no tools yet to clean up the DB.

I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.

Adding a --refilter option to updatedb seemed like the most obvious
course of action.

A completely separate command so it could be initiated by hand would
also work for me.

-- 
Rod Taylor <rb...@sitesell.com>

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Matt Kangas <ka...@gmail.com>.

Rod, I just posted my PruneDB.java file to: http:// 
blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

(104 lines, nutch 0.7 only.)

License granted anyone to hack/copy this as they wish. Should be easy  
to adapt to 0.8.

> Usage: PruneDB <db> -s
> Where: db is the path of the nutch db to prune
> Usage: -s simulate: parses the db, but doesn't delete any pages

--Matt

On Mar 8, 2006, at 1:47 PM, Rod Taylor wrote:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> ab@apache.org wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>

--
Matt Kangas / kangas@gmail.com

CrawlDb Filter tool, was Re: svn commit: r384219 -

Posted by Stefan Groschupf <sg...@media-style.com>.

Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226
Give it a try and let me know if that works for you, in any case  
backup your crawlDb first!!!
I tested it only with a small crawlDb, so it is your own risk. :)

Stefan

Am 08.03.2006 um 19:47 schrieb Rod Taylor:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> ab@apache.org wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com