You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by eks dev <ek...@googlemail.com> on 2011/03/25 13:50:51 UTC

Deduplication questions

Q1. Is is possible to pass *analyzed* content to the

public abstract class Signature {
  public void init(SolrParams nl) {  }
  public abstract String calculate(String content);
}


Q2. Method calculate() is using concatenated fields from <str
name="fields">name,features,cat</str>
Is there any mechanism I could build  "field dependant signatures"?

Use case for this: I have two fields:
OWNER , TEXT
I need to disable *fuzzy* duplicates for one owner, one clean way
would be to make prefixed signature "OWNER/FUZZY_SIGNATURE"

Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
would work)

  <updateRequestProcessorChain name="signature_hard">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">exact_signature</str>
      <str name="fields">OWNER</str>
      <str name="signatureClass">ExactSignature</str>
    </processor>
  </updateRequestProcessorChain>

hard_signature should not be  stored and not indexed field

  <updateRequestProcessorChain name="fuzzy_and_mix">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">true</bool>
      <str name="signatureField">mixed_signature</str>
      <str name="fields">exact_signature, TEXT</str>
      <str name="signatureClass">MixedSignature</str>
    </processor>
  </updateRequestProcessorChain>

 <field name="hard_signature"   type="string" stored="false"
indexed="false" multiValued="false" />
 <field name="mixed_signature" type="string" stored="true"
indexed="true" multiValued="false" />

Assuming I know how long my exact_signature is, I could calculate
fuzzy part and mix it properly.

Possible, better ideas?

Thanks,
eks

Re: Deduplication questions

Posted by Chris Hostetter <ho...@fucit.org>.
: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from <str
: name="fields">name,features,cat</str>
: Is there any mechanism I could build  "field dependant signatures"?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss