You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dave Kor <s0...@sms.ed.ac.uk> on 2005/06/12 16:37:50 UTC

Ideas Needed - Finding Duplicate Documents

Hi,

I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.

You see, I have an index containing roughly 25 million lucene documents. My task
requires me to work at sentence level so each lucene document actually contains
exactly one sentence. The issue I have right now is that sometimes, certain
sentences are duplicated and I'ld like to be able to identify them as a BitSet
so that I can filter away these duplicates in my search.

Obviously the brute force method of pairwise compares would take forever. I have
tried grouping sentences using their hashCodes() and then do a pairwise compare
between sentences that has the same hashCode, but even with a 1GB heap I ran
out of memory after comparing 200k sentences.

Any other ideas?


Regards
Dave Kor.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Ideas Needed - Finding Duplicate Documents

Posted by Chris Lamprecht <cl...@gmail.com>.
I'd have to see your indexing code to see if there are any obvious
performance gotchas there.  If you can run your indexer under a
profiler (OptimizeIt, JProbe, or just the free one with java using
-Xprof), it will tell you in which methods most of your CPU time is
spent.  If you're using StandardAnalyzer, then this may be it --
StandardAnalyzer is a fairly advanced grammar-based parser, but it is
pretty slow.  If you don't need its functionality, then try using a
simpler Analyzer, (like WhitespaceAnalyzer or a subclass).

As far as changing a document within an index -- there is no "update"
operation for documents, there's just delete and add (and then
optimize).  Delete only marks docs as deleted (so they don't come back
in search results); they aren't physically removed from the index
files until you optimize.

Also, it isn't fatal that your current index doesn't have MD5 info in
it.  It's pretty fast to compute MD5 at search time for each document
returned (much faster than the I/O-bound part -- actually retrieving
the docs from the Lucene index).  So you could try just doing all your
duplicate detection at search time.  If this is too slow, you could
consider caching the computed MD5 for your docs.

-chris

On 6/12/05, Dave Kor <s0...@sms.ed.ac.uk> wrote:
> Thanks for the quick reply, Chris.
> 
> Yes, when I say "duplicate" sentences, they are exact copies of the same string.
> 
> The MD5 hash is a good idea, I wish I had thought of it earlier as it would have
> saved me a lot of trouble. Right now it is not feasible to reindex again because
> indexing is a very slow and cpu intensive task for me. I'm adding
> part-of-speech, chunk, named entity and coreference information as I index,
> which means it takes 4 separate servers and 4-5 days of processing to create a
> new index. And as far as I know, you can't change the index once its created.
> Am I correct?
> 
> Any other ideas that don't require me to re-index the whole thing?
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Ideas Needed - Finding Duplicate Documents

Posted by Chris Hostetter <ho...@fucit.org>.
: Yes, when I say "duplicate" sentences, they are exact copies of the same
: string.

you still haven't explained how you indexed these sentences, what do you
mean by "each lucene document actually contains exactly one sentence." ?

Did you tokenize the sentence into one field? do you a field for verbs and
a field for nouns?  what does the structure of your documents look liike?

if (per chance) you have one field in each document that contains the
orriginal, untokenized sentence as an indexed keyword, then finding
duplicates would be pretty damn easy by iterating over a TermEnum on that
field and looking or any term in more then one document.

admitedly, that's a pretty contrived case, and most likely that isn't the
situation you are in -- but it serves as an example of how understanding
your index structure can help people answer your question.

can you send the code the code you used to index these documents?  that
might help people spot novel ways of finding likely duplicates.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Ideas Needed - Finding Duplicate Documents

Posted by Dave Kor <s0...@sms.ed.ac.uk>.
Thanks for the quick reply, Chris.

Yes, when I say "duplicate" sentences, they are exact copies of the same string.

The MD5 hash is a good idea, I wish I had thought of it earlier as it would have
saved me a lot of trouble. Right now it is not feasible to reindex again because
indexing is a very slow and cpu intensive task for me. I'm adding
part-of-speech, chunk, named entity and coreference information as I index,
which means it takes 4 separate servers and 4-5 days of processing to create a
new index. And as far as I know, you can't change the index once its created.
Am I correct?

Any other ideas that don't require me to re-index the whole thing?


Quoting Chris Lamprecht <cl...@gmail.com>:

> Dave,
>
> Can you define exactly what you consider "duplicate sentences"?  Is it
> the same exact string, or the same words in the same order, or the
> same words in any order, etc?
>
> If you can normalize each sentence first, so two "duplicate" sentences
> are always the exact same string, then you should be able to fit a
> hash of each sentence in RAM.  By hash, I mean a
> cryptographic-strength hash, such as MD5.  The big difference between
> an MD5 hash and a hashCode() hash is you won't get collisions with the
> MD5 hash.  (More accurately, the probability of two different objects
> having the same MD5 hash is astronomically low).  Each MD5 takes 16
> bytes, so 16 * 25M sentences = 400MB of memory.  You can probably get
> away with a smaller hash in your case, maybe half of an MD5 (64 bits),
> saving you RAM.  You could also store this MD5 in your lucene index as
> a keyword field (after converting the 16-byte MD5 value into a 32-byte
> hex string, for example), providing a fast way to find duplicates at
> search time.
>
> If you can give more details on your requirements, people in this list
> can probably come up with some pretty good solutions.
>
> -chris
>
> On 6/12/05, Dave Kor <s0...@sms.ed.ac.uk> wrote:
> > Hi,
> >
> > I would like to poll the community's opinion on good strategies for
> identifying
> > duplicate documents in a lucene index.
> >
> > You see, I have an index containing roughly 25 million lucene documents. My
> task
> > requires me to work at sentence level so each lucene document actually
> contains
> > exactly one sentence. The issue I have right now is that sometimes, certain
> > sentences are duplicated and I'ld like to be able to identify them as a
> BitSet
> > so that I can filter away these duplicates in my search.
> >
> > Obviously the brute force method of pairwise compares would take forever. I
> have
> > tried grouping sentences using their hashCodes() and then do a pairwise
> compare
> > between sentences that has the same hashCode, but even with a 1GB heap I
> ran
> > out of memory after comparing 200k sentences.
> >
> > Any other ideas?
> >
> >
> > Regards
> > Dave Kor.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Ideas Needed - Finding Duplicate Documents

Posted by Chris Lamprecht <cl...@gmail.com>.
Dave,

Can you define exactly what you consider "duplicate sentences"?  Is it
the same exact string, or the same words in the same order, or the
same words in any order, etc?

If you can normalize each sentence first, so two "duplicate" sentences
are always the exact same string, then you should be able to fit a
hash of each sentence in RAM.  By hash, I mean a
cryptographic-strength hash, such as MD5.  The big difference between
an MD5 hash and a hashCode() hash is you won't get collisions with the
MD5 hash.  (More accurately, the probability of two different objects
having the same MD5 hash is astronomically low).  Each MD5 takes 16
bytes, so 16 * 25M sentences = 400MB of memory.  You can probably get
away with a smaller hash in your case, maybe half of an MD5 (64 bits),
saving you RAM.  You could also store this MD5 in your lucene index as
a keyword field (after converting the 16-byte MD5 value into a 32-byte
hex string, for example), providing a fast way to find duplicates at
search time.

If you can give more details on your requirements, people in this list
can probably come up with some pretty good solutions.

-chris

On 6/12/05, Dave Kor <s0...@sms.ed.ac.uk> wrote:
> Hi,
> 
> I would like to poll the community's opinion on good strategies for identifying
> duplicate documents in a lucene index.
> 
> You see, I have an index containing roughly 25 million lucene documents. My task
> requires me to work at sentence level so each lucene document actually contains
> exactly one sentence. The issue I have right now is that sometimes, certain
> sentences are duplicated and I'ld like to be able to identify them as a BitSet
> so that I can filter away these duplicates in my search.
> 
> Obviously the brute force method of pairwise compares would take forever. I have
> tried grouping sentences using their hashCodes() and then do a pairwise compare
> between sentences that has the same hashCode, but even with a 1GB heap I ran
> out of memory after comparing 200k sentences.
> 
> Any other ideas?
> 
> 
> Regards
> Dave Kor.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Ideas Needed - Finding Duplicate Documents

Posted by Paul Libbrecht <pa...@activemath.org>.
Have you tried comparing TermVectors ?
I would expect them, or an adjustment of them, to allow comparison to 
focus on "important terms" (e.g. about a 100-200 terms) and then allow 
a more reasonable computation.

paul


Le 12 juin 05, à 16:37, Dave Kor a écrit :

> Hi,
>
> I would like to poll the community's opinion on good strategies for 
> identifying
> duplicate documents in a lucene index.
>
> You see, I have an index containing roughly 25 million lucene 
> documents. My task
> requires me to work at sentence level so each lucene document actually 
> contains
> exactly one sentence. The issue I have right now is that sometimes, 
> certain
> sentences are duplicated and I'ld like to be able to identify them as 
> a BitSet
> so that I can filter away these duplicates in my search.
>
> Obviously the brute force method of pairwise compares would take 
> forever. I have
> tried grouping sentences using their hashCodes() and then do a 
> pairwise compare
> between sentences that has the same hashCode, but even with a 1GB heap 
> I ran
> out of memory after comparing 200k sentences.
>
> Any other ideas?
>
>
> Regards
> Dave Kor.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Ideas Needed - Finding Duplicate Documents

Posted by Karsten Konrad <ka...@dacos.com>.
 
Hi David,

>>
I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.
>>

Do you mean 100% duplicates or some kind of similarity?

>>
Obviously the brute force method of pairwise compares would take forever. I have tried
grouping sentences using their hashCodes() and then do a pairwise compare between
sentences that has the same hashCode, but even with a 1GB heap I ran out of memory
after comparing 200k sentences.
>>

If you are only after 100% duplicates, you are on the right track with
hash code.

You could encode the hash code of the strings into the index by adding it into a
separate field - your analyzer must index numbers for this! Then, iterate over all
tokens of that field, retrieving each document enumerator; wherever you find more than
one document, do the pairwise comparision as usual. This way, you should never need to
compare more than a few documents.

All the best,

Karsten

--

Dr.-Ing. Karsten Konrad

Research & Development
DACOS Software GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
http://www.dacos.com

Tel: ++49/ (0) 681 - 302 64834
Fax: ++49/ (0) 681 - 302 64827



-----Ursprüngliche Nachricht-----
Von: Dave Kor [mailto:s0454888@sms.ed.ac.uk] 
Gesendet: Sonntag, 12. Juni 2005 16:38
An: java-user@lucene.apache.org
Betreff: Ideas Needed - Finding Duplicate Documents

Hi,

I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.

You see, I have an index containing roughly 25 million lucene documents. My task
requires me to work at sentence level so each lucene document actually contains exactly
one sentence. The issue I have right now is that sometimes, certain sentences are
duplicated and I'ld like to be able to identify them as a BitSet so that I can filter
away these duplicates in my search.

Obviously the brute force method of pairwise compares would take forever. I have tried
grouping sentences using their hashCodes() and then do a pairwise compare between
sentences that has the same hashCode, but even with a 1GB heap I ran out of memory
after comparing 200k sentences.

Any other ideas?


Regards
Dave Kor.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org