You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/04 05:37:45 UTC
[Solr Wiki] Trivial Update of "Deduplication" by Mark Miller
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by Mark Miller:
http://wiki.apache.org/solr/Deduplication
------------------------------------------------------------------------------
= Overview =
- Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom implementations for generating the hash/signature.
+ Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom hash/signature implementations to be plugged in.
== Goals ==
* Efficient, hash based exact/near document duplication detection and blocking.
@@ -57, +57 @@
<processor
class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
+ <bool name="enabled">true</bool>
<bool name="blockDupes">true</bool>
<str name="fields">field1,field2</str>
<str name="signatureClass">
@@ -77, +78 @@
||signatureClass || org.apache.solr.update.processor.MD5Signature || A Signature implementation for generating a signature hash. ||
||fields || all fields || The fields to use to generate the signature hash in a comma separated list. By default, all non null fields on the document will be used. ||
||signatureField || signatureField || The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. ||
-
+ ||enabled || true || Enable/disable dedupe factory processing ||
----