You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ia...@thomson.com on 2006/10/14 03:38:45 UTC
Dedup undeletes previously deleted documents
Hi all,
I've noticed that in 0.8.1 org.apache.nutch.indexer.DeleteDuplicates
undeletes previously deleted documents in my index. Here's how you can
reproduce this behavior:
-crawl some documents
-Open the index with Luke and delete some of the documents
-Make a temp dir and move output/index into it
-Nutch dedup your temp dir
-Move temp/index back to /output/index
-Open the index with Luke and you'll see that the previously deleted
documents are alive and well.
Is this intentional?
Thanks,
Ian
Re: Dedup undeletes previously deleted documents
Posted by Andrzej Bialecki <ab...@getopt.org>.
ian.mcnaney@thomson.com wrote:
> Hi all,
>
> I've noticed that in 0.8.1 org.apache.nutch.indexer.DeleteDuplicates
> undeletes previously deleted documents in my index. Here's how you can
>
[..]
> Is this intentional?
>
Good question ... ;) I don't see any reason to do this in the current
code base.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com