You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Steven Rowe <sa...@syr.edu> on 2007/09/25 17:50:10 UTC

Near duplicate detection [was: Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?]

Hi,

Cuong Hoang wrote:
> BTW, has anyone here done any serious near duplication detection with Solr?
> If yes, what approaches did you use?
[...]
> Unfortunately some of our documents are "near duplications" which means they
> are mostly identical (>75%) but usually not 100% identical. hashCode is very
> sensitive to small changes so it can't be used in our case. 

You may be interested in this Lucene java-user ML thread:

<http://www.gossamer-threads.com/lists/lucene/java-user/41103>

The Nutch TextProfileSignature implementation[1] mentioned in the
above-linked thread appears to take an MD5 signature of the
frequency-ordered downcased whitespace-separated tokens from a document.
 This approach is not quite as sensitive to small changes as a direct
hash of the content, but it will likely fail fairly often if you're
looking at differences of more than a few percent (as your ">75%
identical" seems to indicate).

I have done some small-scale deduplication work (without Solr), and
found that a small preprocessing step using regular expressions to
remove changeable content that was not meaningful for the purposes of
comparison (e.g. hit counters and date/time stamps) was fairly
successful in reducing the error rate for a brute-force term frequency
comparison approach (i.e., direct calculation of the angle between doc
pairs' term vectors).

Steve

[1] API doc for Nutch TextProfileSignature class:
<http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/TextProfileSignature.html>