You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Thomas Heigl <th...@systemone.at> on 2010/03/24 10:28:15 UTC

Implementing near duplicate detection algorithm using IDF statistics

Hello,

For my current project I need to implement an index-time mechanism to
detect (near) duplicate documents. The TextProfileSignature available
out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
but does not use global collection statistics in deciding which terms
will be used for calculating the signature.
Most state-of-the-art hash-based duplication detection algorithms make
use of this information to improve precision and recall (e.g.
http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122)

Is it possible to access collection statistics - especially IDF values
for all non-discarded terms in the current document - from within an
implementation of the Signature class?

Kind regards,

Thomas

--
DDI Thomas Heigl
Software Engineer
--------------------------------------------
System One
Gesellschaft für technologiegestützte
Kommunikationsprozesse m.b.H.
Stiftgasse 6/2/6
thomas.heigl@systemone.at
http://www.systemone.at
Powered by Open-Xchange.com

Re: Implementing near duplicate detection algorithm using IDF statistics

Posted by Chris Hostetter <ho...@fucit.org>.

: Is it possible to access collection statistics - especially IDF values
: for all non-discarded terms in the current document - from within an
: implementation of the Signature class?

The Signature API just lets you compute a unique value from a pile of 
Strings, but you could extend the SignatureUpdateProcessorFactory to only 
give the Signature class specific field values based on IDF values (which 
are available to the SignatureUpdateProcessorFactory via the IndexReader 
via the SolrCore via the SolrQueryRequest)

The complication you will run into with an approach like this, is that the 
UpdateProcessor pipeline happens before Analysis (it has to since it might 
be adding/removing fields from the documents) so the String values haven't 
been tokenized yet, so you can't easily "lookup" the IDF of the terms in 
the doc ... you'd have to do your own preliminary Analysis of the raw 
field values.


-Hoss

Re: Implementing near duplicate detection algorithm using IDF statistics

Posted by Ted Dunning <te...@gmail.com>.

For reference, you can get a rental copy of this article for less than the
cost of the full PDF download here:


http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd

(joining the ACM is also a good thing to do)

(and yes, this is licensed by the ACM)

On Wed, Mar 24, 2010 at 2:28 AM, Thomas Heigl <th...@systemone.at>wrote:

> Hello,
>
> For my current project I need to implement an index-time mechanism to
> detect (near) duplicate documents. The TextProfileSignature available
> out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
> but does not use global collection statistics in deciding which terms
> will be used for calculating the signature.
> Most state-of-the-art hash-based duplication detection algorithms make
> use of this information to improve precision and recall (e.g.
>
> http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122
> )
>
> Is it possible to access collection statistics - especially IDF values
> for all non-discarded terms in the current document - from within an
> implementation of the Signature class?
>
> Kind regards,
>
> Thomas
>
> --
> DDI Thomas Heigl
> Software Engineer
> --------------------------------------------
> System One
> Gesellschaft für technologiegestützte
> Kommunikationsprozesse m.b.H.
> Stiftgasse 6/2/6
> thomas.heigl@systemone.at
> http://www.systemone.at
> Powered by Open-Xchange.com
>