You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/04 01:02:59 UTC

[Solr Wiki] Update of "Deduplication" by Mark Miller

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by Mark Miller:
http://wiki.apache.org/solr/Deduplication

New page:
= Document Duplication Detection  =
<!> ["Planning"]

[[TableOfContents]]

= Overview =

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom implementations for generating the hash/signature.

== Goals ==
 * Efficient, hash based exact/near document duplication detection and blocking.
 

= Design Overview =

Signature

A class capable of generating a signature String from the concatenation of a group of specified document fields.

Implementations:

MD5Signature - Used for exact duplicate detection.

TextProfileSignature - Fuzzy hashing implementation from nutch for near duplicate detection

The DeduplicateUpdateProcessorFactory has to be registered in the solrconfig.xml as part of the UpdateRequest Chain:

Accepting all defaults:
{{{
  <updateRequestProcessorChain name="dedupe">
    <processor
      class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">

    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
}}}

Example settings:
{{{
  <updateRequestProcessorChain name="dedupe">
    <processor
      class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">

        <bool name="blockDupes">true</bool>
        <str name="fields">field1,field2</str>
 	<str name="signatureClass">
          org.apache.solr.update.processor.TextProfileSignature
	</str>
        <str name="signatureField">signatureField</str>
 
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
}}}


= Settings =
|| Setting || '''default''' || Description||
||blockDupes ||  true || Blocks documents with matching signatures from entering the index. This setting will not honor the allowDupes setting (which blocks dupes based on the unique field) – instead, duplicates will only be allowed or blocked based on the signature field.||
||signatureClass || org.apache.solr.update.processor.MD5Signature || A Signature implementation for generating a signature hash.||
||fields || all fields || The fields to use to generate the signature hash in a comma separated list. By default, all non null fields on the document will be used.||
||signatureField || signatureField || The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml.||


----