You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/03/17 19:44:23 UTC

removing site from webdb

We've got a site that is causing our crawl to slow dramatically, from 
20mbits down to about 3 or 4.  The basic problem is that the site seems 
to consist of huge numbers of pages that aren't responding.  We can 
remove the site from the index, but it seems like a problem to remove 
this site permanently from the webdb so that we never fetch it again.  
Is there an easy way in 0.71 to remove a site from the webdb, and then 
keep it permanently removed?

Re: removing site from webdb

Posted by Rod Taylor <rb...@sitesell.com>.

On Fri, 2006-03-17 at 13:44 -0500, Insurance Squared Inc. wrote:
> We've got a site that is causing our crawl to slow dramatically, from 
> 20mbits down to about 3 or 4.  The basic problem is that the site seems 
> to consist of huge numbers of pages that aren't responding.  We can 
> remove the site from the index, but it seems like a problem to remove 
> this site permanently from the webdb so that we never fetch it again.  
> Is there an easy way in 0.71 to remove a site from the webdb, and then 
> keep it permanently removed?

You can add a filter on that domain to your regex-urlfilter.txt file, or
you can allow nutch to churn though each URL and mark it as invalid
individually.

This process can be done quite quickly if Nutch scales the number of
threads to achieve the best use of bandwidth.

Encourage the Nutch folks to apply this patch. I give it 50Mbits and
Nutch will scale up to 500 threads per task if most threads are hitting
bad pages or down to about 60 threads per task if they're downloading
large pages. In the end we stay within about 10% of the 50Mbit target.

http://issues.apache.org/jira/browse/NUTCH-207

-- 
Rod Taylor <rb...@sitesell.com>

Re: removing site from webdb

Posted by Matt Kangas <ka...@gmail.com>.

An easy way to do this for Nutch 0.7.1:
- Adjust regex-urlfilter.txt (as Rod mentioned), or some other  
component of your URLFilter chain, to screen out the site
- Run my PruneDBTool to force all URLs in the webdb through the  
URLFilter chain again

Code is here:
http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

(It won't work for 0.8. Hopefully won't be necessary, 'tho.)

--Matt

On Mar 17, 2006, at 1:44 PM, Insurance Squared Inc. wrote:

> We've got a site that is causing our crawl to slow dramatically,  
> from 20mbits down to about 3 or 4.  The basic problem is that the  
> site seems to consist of huge numbers of pages that aren't  
> responding.  We can remove the site from the index, but it seems  
> like a problem to remove this site permanently from the webdb so  
> that we never fetch it again.  Is there an easy way in 0.71 to  
> remove a site from the webdb, and then keep it permanently removed?
>
>

--
Matt Kangas / kangas@gmail.com