You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by og...@yahoo.com on 2007/04/05 08:47:06 UTC

Removing pages from index immediately

Hi,

I'd like to be able to immediately remove certain pages from Nutch (index, crawldb, linkdb...).
The scenario is that I'm using Nutch to index a single site or a set of internal sites.  Once in a while editors of the site remove a page from the site.  When that happens, I want to update at least the index and ideally crawldb, linkdb, so that people searching the index don't get the missing page in results and end up going there, hitting the 404.

I don't think there is a "direct" way to do this with Nutch, is there?
If there really is no direct way to do this, I was thinking I'd just put the URL of the recently removed page into the first next fetchlist and then somehow get Nutch to immediately remove that page/URL once it hits a 404.  How does that sound?

Is there a way to configure Nutch to delete the page after it gets a 404 for it even just once?  I thought I saw the setting for that somewhere a few weeks ago, but now I can't find it.

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



Re: Removing pages from index immediately

Posted by Enis Soztutar <en...@gmail.com>.
Since hadoop's map files are write once, it is not possible to delete 
some urls from the crawldb and linkdb. The only thing you can do is to 
create the map files once again without the deleted urls. But running 
the crawl once more as you suggested seems more appropriate. Deleting 
documents from the index is just lucene stuff.

In your case it seems that every once in a while, you crawl the whole 
site, and create the indexes and db's and then just throw the old one 
out. And between two crawls you can delete the urls from the index.

ogjunk-nutch@yahoo.com wrote:
> Hi,
>
> I'd like to be able to immediately remove certain pages from Nutch (index, crawldb, linkdb...).
> The scenario is that I'm using Nutch to index a single site or a set of internal sites.  Once in a while editors of the site remove a page from the site.  When that happens, I want to update at least the index and ideally crawldb, linkdb, so that people searching the index don't get the missing page in results and end up going there, hitting the 404.
>
> I don't think there is a "direct" way to do this with Nutch, is there?
> If there really is no direct way to do this, I was thinking I'd just put the URL of the recently removed page into the first next fetchlist and then somehow get Nutch to immediately remove that page/URL once it hits a 404.  How does that sound?
>
> Is there a way to configure Nutch to delete the page after it gets a 404 for it even just once?  I thought I saw the setting for that somewhere a few weeks ago, but now I can't find it.
>
> Thanks,
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
>
>
>