You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ia...@thomson.com on 2006/10/14 03:38:45 UTC

Dedup undeletes previously deleted documents

Hi all,

  I've noticed that in 0.8.1 org.apache.nutch.indexer.DeleteDuplicates
undeletes previously deleted documents in my index.  Here's how you can
reproduce this behavior:

 

-crawl some documents

-Open the index with Luke and delete some of the documents

-Make a temp dir and move output/index into it

-Nutch dedup your temp dir

-Move temp/index back to /output/index

-Open the index with Luke and you'll see that the previously deleted
documents are alive and well.

 

  Is this intentional?

 

Thanks,

Ian


Re: Dedup undeletes previously deleted documents

Posted by Andrzej Bialecki <ab...@getopt.org>.
ian.mcnaney@thomson.com wrote:
> Hi all,
>
>   I've noticed that in 0.8.1 org.apache.nutch.indexer.DeleteDuplicates
> undeletes previously deleted documents in my index.  Here's how you can
>   

[..]

>   Is this intentional?
>   

Good question ... ;) I don't see any reason to do this in the current 
code base.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com