You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by gekkokid <me...@gekkokid.org.uk> on 2006/01/28 18:00:09 UTC

deleting duplicate documents from my index

Hi, im trying to delete duplicate documents from my index, the unique indentifier is the documents url (aka field "url").

my initial thought of how to acomplish this is to open the index via a reader and sort them by the documents url and then iterate through them looking for a match with the current document and the previous document, if it matches i would delete the current document etc.

what other methods that are not too taxing could i try?

how could i sort the documents via url internally? what classes should i be looking at to do this


Thanks,
_gk

Re: deleting duplicate documents from my index

Posted by gekkokid <me...@gekkokid.org.uk>.

hi, thats exactly what i did :) works perfectly

thanks

_gk
----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: <ja...@lucene.apache.org>
Sent: Monday, January 30, 2006 5:56 AM
Subject: Re: deleting duplicate documents from my index


>
> : Hi, im trying to delete duplicate documents from my index, the unique
> : indentifier is the documents url (aka field "url").
> :
> : my initial thought of how to acomplish this is to open the index via a
> : reader and sort them by the documents url and then iterate through them
> : looking for a match with the current document and the previous document,
> : if it matches i would delete the current document etc.
>
> Assuming your "url" filed is a keyword field (indexed, not-tokenized) then
> take a look at IndexReader.termEnum ... if you start with new
> Term("url","") and iterate untill the field is no longer url, you'll be
> iterating over every url Term in your index.  for each one, check docFreq,
> and if it's more then 1 you've got a dup.
>
> Then look at IndexReader.termDocs for an easy way to find out which docs
> share that url.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: deleting duplicate documents from my index

Posted by Chris Hostetter <ho...@fucit.org>.

: Hi, im trying to delete duplicate documents from my index, the unique
: indentifier is the documents url (aka field "url").
:
: my initial thought of how to acomplish this is to open the index via a
: reader and sort them by the documents url and then iterate through them
: looking for a match with the current document and the previous document,
: if it matches i would delete the current document etc.

Assuming your "url" filed is a keyword field (indexed, not-tokenized) then
take a look at IndexReader.termEnum ... if you start with new
Term("url","") and iterate untill the field is no longer url, you'll be
iterating over every url Term in your index.  for each one, check docFreq,
and if it's more then 1 you've got a dup.

Then look at IndexReader.termDocs for an easy way to find out which docs
share that url.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: deleting duplicate documents from my index

Posted by Jeff Rodenburg <je...@gmail.com>.

One way to do this (depending on your system and index size) is to remove
and add every url you find.  This would ensure that every document in the
index is unique.  No need to worry about sorting and iteration and doc_ids
and the like.

It rebuilds your entire index, but if you have a duplication issue that
needs to be addressed, it's worth it.

Hope this helps.

-- j

On 1/28/06, gekkokid <me...@gekkokid.org.uk> wrote:
>
> Hi, im trying to delete duplicate documents from my index, the unique
> indentifier is the documents url (aka field "url").
>
> my initial thought of how to acomplish this is to open the index via a
> reader and sort them by the documents url and then iterate through them
> looking for a match with the current document and the previous document, if
> it matches i would delete the current document etc.
>
> what other methods that are not too taxing could i try?
>
> how could i sort the documents via url internally? what classes should i
> be looking at to do this
>
>
> Thanks,
> _gk
>