You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by João Rodrigues <an...@gmail.com> on 2008/04/29 19:48:45 UTC

Removing duplicate entries

Hello all. Before I ask my question, I'd like to clarify I've read the
manual and searched the archives, and if I'm here, it is because I've
neither found a suitable answer, or (most likely) I didn't understand those
which I did find :)

I have an index built, which I update regularly. However, there may be some
entries, at the time of the update, which are already in the index. As such,
I'd like to remove such entries. I've read about some IndexReader.termEnum
stuff, but I can't make anything of it =\ I can (and will) rebuild my index,
so I'm looking at a "runtime" approach.

Here's what I'd like to do: I'm indexing the new documents to a RAM Index.
Then, I merge it with the FS one, already on the hard drive and containing
all the old documents. Is there a way to check if a given document is in the
index, and if it is, replace it with this new one? I know (or at least think
I know) how to do this "replacement", but it's the comparison I'm failing to
grasp.


Thanks for any answers :)

-- 
João Rodrigues

Re: Removing duplicate entries

Posted by João Rodrigues <an...@gmail.com>.
>Probably something very like that, although you see none of that. Just
>doing a deleteDocument(term) does it all for you. And I learned long ago
>that the folks who write this kind of stuff can probably do it more
>efficiently
>than I can <G>.

And probably more efficiently that I can as well :) Thanks for the tip.

>I ask because a lot of people are mistakenly suppose that RAMdirs are
faster
>because they have RAM in front <G>. The indexing process uses RAM
implicitly
>and periodically flushes to FS. You can control this by various parameters
>on
>IndexWriter like setMaxMergeDocs, setMergeFactor etc. and I personally
>prefer letting that do the work and avoiding having to code merging the
two.

>But your point about failure is valid, although you might want to search
the
>mail archive for discussions on this as there's been work done to make this
>less of a concern....

Yep, I'm (was, thanks to you :P) one of those persons. I'd already read
that, but since I had the code written, I was thinking of letting go :) But
then again, since I'm getting a major memory leak, I believe I'll try that
option!

Thanks for the help!
Best Regards,

-- 
João Rodrigues

Re: Removing duplicate entries

Posted by Erick Erickson <er...@gmail.com>.
See below:

On Tue, Apr 29, 2008 at 9:51 PM, João Rodrigues <an...@gmail.com> wrote:

> First of all, let me apologize for the double post but I got some strange
> error message =\
>
> >The first question is what do you mean the document
> >is already in the index? Lucene doc IDs are useless
> >here since the ones in your FSDir and the ones in your
> >RAMdir are unrelated. In fact, I suspect that the
> >lucene docIDs will start at the same number in both.
>
> The documents have 2 fields, one of them being their identifier, which is
> Indexed and Not Tokenized (and stored).
>

Cool, this will work just fine for you. You can forget everything
I said about Lucene doc IDs.

>
> >Lucene doc IDs are just monotonically incremented integers.
> >So, how do you identify identical documents? Is there some
> >field in your document that's guaranteed to be unique to each
> >document? If so, *that's* the field you can use for termEnu
> >to get the Lucene docid to remove, assuming you've indexed
> >it UN_TOKENIZED or you are very, very, very confident that
> >your tokenizers won't break it up.
> >But you can make this easier by using IndexReader.deleteDocument(term)
> >where the term is your unique field.
>
> Since I have an unique field for each document, I'll use that then.
> However,
> what's the thinking behind the enum/delete? The termEnum lists me the
> terms
> which have a particular field value and gives me their "lucene ID" so that
> the Reader can then delete it?
>

Probably something very like that, although you see none of that. Just
doing a deleteDocument(term) does it all for you. And I learned long ago
that the folks who write this kind of stuff can probably do it more
efficiently
than I can <G>.


>
> >Additionally, I question why you bother with a RAMdir for your changes.
> >An index reader essentially takes a snapshot of your index, and
> >subsequent changes are not seen by your searchers until you
> >close and reopen the underlying reader. What advantage do you
> >see in using a RAMdir?
>
> I use a RAMdir because it is somehow faster. I'm downloading a few
> thousand
> documents at a time from the internet and then indexing them in a RAM
> index
> and then merging them to the FS dir. Also, in case anything fails, my FS
> directory, and thus my "stable" index, is safe from harm :) But why did
> you
> ask? Advice is welcome :)
>

I ask because a lot of people are mistakenly suppose that RAMdirs are faster
because they have RAM in front <G>. The indexing process uses RAM implicitly
and periodically flushes to FS. You can control this by various parameters
on
IndexWriter like setMaxMergeDocs, setMergeFactor etc. and I personally
prefer letting that do the work and avoiding having to code merging the two.

But your point about failure is valid, although you might want to search the
mail archive for discussions on this as there's been work done to make this
less of a concern....

Best
Erick


>
> Thanks,
>
> João Rodrigues
>

Re: Removing duplicate entries

Posted by João Rodrigues <an...@gmail.com>.
First of all, let me apologize for the double post but I got some strange
error message =\

>The first question is what do you mean the document
>is already in the index? Lucene doc IDs are useless
>here since the ones in your FSDir and the ones in your
>RAMdir are unrelated. In fact, I suspect that the
>lucene docIDs will start at the same number in both.

The documents have 2 fields, one of them being their identifier, which is
Indexed and Not Tokenized (and stored).

>Lucene doc IDs are just monotonically incremented integers.
>So, how do you identify identical documents? Is there some
>field in your document that's guaranteed to be unique to each
>document? If so, *that's* the field you can use for termEnu
>to get the Lucene docid to remove, assuming you've indexed
>it UN_TOKENIZED or you are very, very, very confident that
>your tokenizers won't break it up.
>But you can make this easier by using IndexReader.deleteDocument(term)
>where the term is your unique field.

Since I have an unique field for each document, I'll use that then. However,
what's the thinking behind the enum/delete? The termEnum lists me the terms
which have a particular field value and gives me their "lucene ID" so that
the Reader can then delete it?

>Additionally, I question why you bother with a RAMdir for your changes.
>An index reader essentially takes a snapshot of your index, and
>subsequent changes are not seen by your searchers until you
>close and reopen the underlying reader. What advantage do you
>see in using a RAMdir?

I use a RAMdir because it is somehow faster. I'm downloading a few thousand
documents at a time from the internet and then indexing them in a RAM index
and then merging them to the FS dir. Also, in case anything fails, my FS
directory, and thus my "stable" index, is safe from harm :) But why did you
ask? Advice is welcome :)

Thanks,

João Rodrigues

Re: Removing duplicate entries

Posted by Erick Erickson <er...@gmail.com>.
The first question is what do you mean the document
is already in the index? Lucene doc IDs are useless
here since the ones in your FSDir and the ones in your
RAMdir are unrelated. In fact, I suspect that the
lucene docIDs will start at the same number in both.

Lucene doc IDs are just monotonically incremented integers.

So, how do you identify identical documents? Is there some
field in your document that's guaranteed to be unique to each
document? If so, *that's* the field you can use for termEnum
to get the Lucene docid to remove, assuming you've indexed
it UN_TOKENIZED or you are very, very, very confident that
your tokenizers won't break it up.

But you can make this easier by using IndexReader.deleteDocument(term)
where the term is your unique field.

Additionally, I question why you bother with a RAMdir for your changes.
An index reader essentially takes a snapshot of your index, and
subsequent changes are not seen by your searchers until you
close and reopen the underlying reader. What advantage do you
see in using a RAMdir?

Best
Erick

On Tue, Apr 29, 2008 at 1:57 PM, João Rodrigues <an...@gmail.com> wrote:

> Hello all. Before I ask my question, I'd like to clarify I've read the
> manual and searched the archives, and if I'm here, it is because I've
> neither found a suitable answer, or (most likely) I didn't understand
> those
> which I did find :)
>
> I have an index built, which I update regularly. However, there may be
> some
> entries, at the time of the update, which are already in the index. As
> such,
> I'd like to remove such entries. I've read about some IndexReader.termEnum
> stuff, but I can't make anything of it =\ I can (and will) rebuild my
> index,
> so I'm looking at a "runtime" approach.
>
> Here's what I'd like to do: I'm indexing the new documents to a RAM Index.
> Then, I merge it with the FS one, already on the hard drive and containing
> all the old documents. Is there a way to check if a given document is in
> the
> index, and if it is, replace it with this new one? I know (or at least
> think
> I know) how to do this "replacement", but it's the comparison I'm failing
> to
> grasp.
>
>
> Thanks for any answers :)
>
> --
> João Rodrigues
>

Fwd: Removing duplicate entries

Posted by João Rodrigues <an...@gmail.com>.
Hello all. Before I ask my question, I'd like to clarify I've read the
manual and searched the archives, and if I'm here, it is because I've
neither found a suitable answer, or (most likely) I didn't understand those
which I did find :)

I have an index built, which I update regularly. However, there may be some
entries, at the time of the update, which are already in the index. As such,
I'd like to remove such entries. I've read about some IndexReader.termEnum
stuff, but I can't make anything of it =\ I can (and will) rebuild my index,
so I'm looking at a "runtime" approach.

Here's what I'd like to do: I'm indexing the new documents to a RAM Index.
Then, I merge it with the FS one, already on the hard drive and containing
all the old documents. Is there a way to check if a given document is in the
index, and if it is, replace it with this new one? I know (or at least think
I know) how to do this "replacement", but it's the comparison I'm failing to
grasp.


Thanks for any answers :)

-- 
João Rodrigues