You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by eks dev <ek...@googlemail.com> on 2011/03/25 13:50:51 UTC
Deduplication questions
Q1. Is is possible to pass *analyzed* content to the
public abstract class Signature {
public void init(SolrParams nl) { }
public abstract String calculate(String content);
}
Q2. Method calculate() is using concatenated fields from <str
name="fields">name,features,cat</str>
Is there any mechanism I could build "field dependant signatures"?
Use case for this: I have two fields:
OWNER , TEXT
I need to disable *fuzzy* duplicates for one owner, one clean way
would be to make prefixed signature "OWNER/FUZZY_SIGNATURE"
Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
would work)
<updateRequestProcessorChain name="signature_hard">
<bool name="enabled">true</bool>
<bool name="overwriteDupes">false</bool>
<str name="signatureField">exact_signature</str>
<str name="fields">OWNER</str>
<str name="signatureClass">ExactSignature</str>
</processor>
</updateRequestProcessorChain>
hard_signature should not be stored and not indexed field
<updateRequestProcessorChain name="fuzzy_and_mix">
<bool name="enabled">true</bool>
<bool name="overwriteDupes">true</bool>
<str name="signatureField">mixed_signature</str>
<str name="fields">exact_signature, TEXT</str>
<str name="signatureClass">MixedSignature</str>
</processor>
</updateRequestProcessorChain>
<field name="hard_signature" type="string" stored="false"
indexed="false" multiValued="false" />
<field name="mixed_signature" type="string" stored="true"
indexed="true" multiValued="false" />
Assuming I know how long my exact_signature is, I could calculate
fuzzy part and mix it properly.
Possible, better ideas?
Thanks,
eks
Re: Deduplication questions
Posted by Chris Hostetter <ho...@fucit.org>.
: Q1. Is is possible to pass *analyzed* content to the
:
: public abstract class Signature {
No, analysis happens as the documents are being written to the lucene
index, well after the UpdateProcessors have had a chance to interact with
the values.
: Q2. Method calculate() is using concatenated fields from <str
: name="fields">name,features,cat</str>
: Is there any mechanism I could build "field dependant signatures"?
At the moment the Signature API is fairly minimal, but it could definitley
be improved by adding more methods (that have sensible defaults in the
base class) that would give the impl more control over teh resulting
signature ... we just beed people to propose good suggestions with example
use cases.
: Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)
I don't know that what you describe is really intentional or not, but it
should work
-Hoss