You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/04 01:02:59 UTC
[Solr Wiki] Update of "Deduplication" by Mark Miller
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by Mark Miller:
http://wiki.apache.org/solr/Deduplication
New page:
= Document Duplication Detection =
<!> ["Planning"]
[[TableOfContents]]
= Overview =
Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for custom implementations for generating the hash/signature.
== Goals ==
* Efficient, hash based exact/near document duplication detection and blocking.
= Design Overview =
Signature
A class capable of generating a signature String from the concatenation of a group of specified document fields.
Implementations:
MD5Signature - Used for exact duplicate detection.
TextProfileSignature - Fuzzy hashing implementation from nutch for near duplicate detection
The DeduplicateUpdateProcessorFactory has to be registered in the solrconfig.xml as part of the UpdateRequest Chain:
Accepting all defaults:
{{{
<updateRequestProcessorChain name="dedupe">
<processor
class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
}}}
Example settings:
{{{
<updateRequestProcessorChain name="dedupe">
<processor
class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
<bool name="blockDupes">true</bool>
<str name="fields">field1,field2</str>
<str name="signatureClass">
org.apache.solr.update.processor.TextProfileSignature
</str>
<str name="signatureField">signatureField</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
}}}
= Settings =
|| Setting || '''default''' || Description||
||blockDupes || true || Blocks documents with matching signatures from entering the index. This setting will not honor the allowDupes setting (which blocks dupes based on the unique field) – instead, duplicates will only be allowed or blocked based on the signature field.||
||signatureClass || org.apache.solr.update.processor.MD5Signature || A Signature implementation for generating a signature hash.||
||fields || all fields || The fields to use to generate the signature hash in a comma separated list. By default, all non null fields on the document will be used.||
||signatureField || signatureField || The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml.||
----