You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/04 05:37:45 UTC

[Solr Wiki] Trivial Update of "Deduplication" by Mark Miller

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by Mark Miller:
http://wiki.apache.org/solr/Deduplication

------------------------------------------------------------------------------
  
  = Overview =
  
- Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom implementations for generating the hash/signature.
+ Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom hash/signature implementations to be plugged in.
  
  == Goals ==
   * Efficient, hash based exact/near document duplication detection and blocking.
@@ -57, +57 @@

      <processor
        class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
  
+         <bool name="enabled">true</bool>
          <bool name="blockDupes">true</bool>
          <str name="fields">field1,field2</str>
   	<str name="signatureClass">
@@ -77, +78 @@

  ||signatureClass || org.apache.solr.update.processor.MD5Signature || A Signature implementation for generating a signature hash. ||
  ||fields || all fields || The fields to use to generate the signature hash in a comma separated list. By default, all non null fields on the document will be used. ||
  ||signatureField || signatureField || The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. ||
- 
+ ||enabled || true || Enable/disable dedupe factory processing ||
  
  ----