You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2016/07/13 22:20:46 UTC

RE: Nutch db_gone

Hello Mark - why? Although this is possible to do so, for what reason because it makes no sense. Gone records are not reindexed, they are ignored, or with the correct flags even removed from the index.

In any case, in Nutch 1.x the CrawlDB is read (optionally in trunk i believe) and the number of 404's in the segment are passed as well. With some clever key/value passing In indexermapreduce, it is straightforward to get that value beforehand.

M.

-----Original message-----
> From:mark mark <ma...@gmail.com>
> Sent: Thursday 23rd June 2016 19:52
> To: user@nutch.apache.org
> Subject: Nutch db_gone
> 
> Hi,
> 
> I am using nutch 1.X, in code(plugin) need a way to get total db_gone
> document.
> 
> We want to set some threshold on db_gone document, before indexing we want
> to check number of gone document and if it more than our thrash-hold we
> don't want to index.
> 
> We want to do this from code.
> 
> Thanks Mark
>