You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wangsheng Mei <ha...@gmail.com> on 2010/01/29 19:43:45 UTC

Solr duplicates detection!!

Document Duplication Detection

[image: <!>] Solr1.4 </solr/Solr1.4>

目录

   1. Document Duplication Detection <#Document_Duplication_Detection>
   2. Overview <#Overview>
      1. Goals <#Goals>
      2. Design <#Design>
   3. Notes <#Notes>
   4. Configuration <#Configuration>
      1. solrconfig.xml <#solrconfig.xml>
         1. Note <#Note>
      2. Settings <#Settings>

 Overview

Preventing duplicate or near duplicate documents from entering an index or
tagging documents with a signature/fingerprint for duplicate field
collapsing can be efficiently achieved with a low collision or fuzzy hash
algorithm. Solr should natively support deduplication techniques of this
type and allow for the easy addition of new hash/signature implementations.

Goals

   - Efficient, hash based exact/near document duplication detection and
   blocking.
   - Allow for both duplicate collapsing in search results as well as
   deduplication on adding a document.

 Design

Signature

A class capable of generating a signature String from the concatenation of a
group of specified document fields.

public abstract class Signature {
  public void init(SolrParams nl) {
  }

  public abstract String calculate(String content);
}

Implementations:

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature </solr/Lookup3Signature>

64 bit hash used for exact duplicate detection, much faster than MD5 and
smaller to index

TextProfileSignature </solr/TextProfileSignature>

Fuzzy hashing implementation from nutch for near duplicate detection. Its
tunable but works best on longer text.

There are other more sophisticated algorithms for fuzzy/near hashing that
could be added later.

Notes

Adding in the dedupe process will change the allowDups setting so that it
applies to an update Term (with field signatureField in this case) rather
than the unique field Term (of course the signatureField could be the unique
field, but generally you want the unique field to be unique)

When a document is added, a signature will automatically be generated and
attached to the document in the specified signatureField.

Configuration

solrconfig.xml

The SignatureUpdateProcessorFactory
</solr/SignatureUpdateProcessorFactory>has to be registered in the
solrconfig.xml as part of the
UpdateRequest </solr/UpdateRequest> Chain:

Accepting all defaults:

  <updateRequestProcessorChain name="dedupe">
    <processor
      class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

Example settings:

  <!-- An example dedup update processor that creates the "id" field on the fly
       based on the hash code of some other fields.  This example has
overwriteDupes
       set to false since we are using the id field as the
signatureField and Solr
       will maintain uniqueness based on that anyway. -->
  <updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">id</str>
      <str name="fields">name,features,cat</str>
      <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

 Note

Also be sure to change your update handlers to use the defined chain, i.e.

  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.processor">dedupe</str>
    </lst>
  </requestHandler>

The update processor can also be specified per request with a parameter of
update.processor=dedupe

Settings

*Setting*

*Default*

*Description*

signatureClass

org.apache.solr.update.processor.Lookup3Signature </solr/Lookup3Signature>

A Signature implementation for generating a signature hash.

fields

all fields

The fields to use to generate the signature hash in a comma separated list.
By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. Be sure the
field is defined in schema.xml.

enabled

true

Enable/disable dedupe factory processing


-- 
梅旺生

Re: Solr duplicates detection!!

Posted by Wangsheng Mei <ha...@gmail.com>.
Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei <ha...@gmail.com>

> Document Duplication Detection
>
> [image: <!>] Solr1.4 <http://solr/Solr1.4>
>
> 目录
>
>    1. Document Duplication Detection<#1267b655a97b48f5_Document_Duplication_Detection>
>    2. Overview <#1267b655a97b48f5_Overview>
>       1. Goals <#1267b655a97b48f5_Goals>
>       2. Design <#1267b655a97b48f5_Design>
>    3. Notes <#1267b655a97b48f5_Notes>
>    4. Configuration <#1267b655a97b48f5_Configuration>
>       1. solrconfig.xml <#1267b655a97b48f5_solrconfig.xml>
>          1. Note <#1267b655a97b48f5_Note>
>       2. Settings <#1267b655a97b48f5_Settings>
>
>  Overview
>
> Preventing duplicate or near duplicate documents from entering an index or
> tagging documents with a signature/fingerprint for duplicate field
> collapsing can be efficiently achieved with a low collision or fuzzy hash
> algorithm. Solr should natively support deduplication techniques of this
> type and allow for the easy addition of new hash/signature implementations.
>
> Goals
>
>    - Efficient, hash based exact/near document duplication detection and
>    blocking.
>    - Allow for both duplicate collapsing in search results as well as
>    deduplication on adding a document.
>
>  Design
>
> Signature
>
> A class capable of generating a signature String from the concatenation of
> a group of specified document fields.
>
> public abstract class Signature {
>   public void init(SolrParams nl) {
>   }
>
>   public abstract String calculate(String content);
> }
>
> Implementations:
>
> MD5Signature
>
> 128 bit hash used for exact duplicate detection.
>
> Lookup3Signature <http://solr/Lookup3Signature>
>
> 64 bit hash used for exact duplicate detection, much faster than MD5 and
> smaller to index
>
> TextProfileSignature <http://solr/TextProfileSignature>
>
> Fuzzy hashing implementation from nutch for near duplicate detection. Its
> tunable but works best on longer text.
>
> There are other more sophisticated algorithms for fuzzy/near hashing that
> could be added later.
>
> Notes
>
> Adding in the dedupe process will change the allowDups setting so that it
> applies to an update Term (with field signatureField in this case) rather
> than the unique field Term (of course the signatureField could be the unique
> field, but generally you want the unique field to be unique)
>
> When a document is added, a signature will automatically be generated and
> attached to the document in the specified signatureField.
>
> Configuration
>
> solrconfig.xml
>
> The SignatureUpdateProcessorFactory<http://solr/SignatureUpdateProcessorFactory>has to be registered in the solrconfig.xml as part of the
> UpdateRequest <http://solr/UpdateRequest> Chain:
>
> Accepting all defaults:
>
>   <updateRequestProcessorChain name="dedupe">
>     <processor
>       class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>     </processor>
>     <processor class="solr.RunUpdateProcessorFactory" />
>
>   </updateRequestProcessorChain>
>
> Example settings:
>
>   <!-- An example dedup update processor that creates the "id" field on the fly
>        based on the hash code of some other fields.  This example has overwriteDupes
>        set to false since we are using the id field as the signatureField and Solr
>
>        will maintain uniqueness based on that anyway. -->
>   <updateRequestProcessorChain name="dedupe">
>     <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>
>       <bool name="enabled">true</bool>
>       <bool name="overwriteDupes">false</bool>
>       <str name="signatureField">id</str>
>       <str name="fields">name,features,cat</str>
>
>       <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>
>   </updateRequestProcessorChain>
>
>  Note
>
> Also be sure to change your update handlers to use the defined chain, i.e.
>
>   <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
>     <lst name="defaults">
>       <str name="update.processor">dedupe</str>
>
>     </lst>
>   </requestHandler>
>
> The update processor can also be specified per request with a parameter of
> update.processor=dedupe
>
> Settings
>
> *Setting*
>
> *Default*
>
> *Description*
>
> signatureClass
>
> org.apache.solr.update.processor.Lookup3Signature<http://solr/Lookup3Signature>
>
> A Signature implementation for generating a signature hash.
>
> fields
>
> all fields
>
> The fields to use to generate the signature hash in a comma separated list.
> By default, all fields on the document will be used.
>
> signatureField
>
> signatureField
>
> The name of the field used to hold the fingerprint/signature. Be sure the
> field is defined in schema.xml.
>
> enabled
>
> true
>
> Enable/disable dedupe factory processing
>
>
> --
> 梅旺生
>



-- 
梅旺生