You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Brown <Ja...@sjp.co.uk> on 2010/12/14 14:26:21 UTC
De-duplication not working as I expected - duplicates still getting into the index
I have configured de-duplication according to the Wiki..........
My signature field is defined thus...
<field name="signature" type="string" stored="true" indexed="true" multiValued="false" />
and my updateRequestProcessor as follows....
<updateRequestProcessorChain name="dedupe">
<processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<bool name="overwriteDupes">false</bool>
<str name="signatureField">signature</str>
<str name="fields">content</str>
<str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
I am using SOLRJ to write to the index with the binary (as opposed to XML) so my update handler is defined as below.....
<requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler" >
<lst name="defaults">
<str name="update.processor">dedupe</str>
</lst>
</requestHandler>
However I was expecting SOLR to only allow 1 instance of a duplicate document into the index, but I get the following results when I query mt index...
I have deliberately added my ISA Letter file 4 times and can see it has correctly generated an identical signature for the first 4 entries (d91a5ce933457fd5). The fifth entry is a different document and correctly has a different signature.
I was expecting to only see 1 instance of the duplicate. Am I misinterpreting the way it works? Many Thanks.
<result name="response" numFound="36" start="0">
?
<doc>
<str name="doctitle">ISA Letter</str>
<str name="signature">d91a5ce933457fd5</str>
</doc>
?
<doc>
<str name="doctitle">ISA Letter</str>
<str name="signature">d91a5ce933457fd5</str>
</doc>
?
<doc>
<str name="doctitle">ISA Letter</str>
<str name="signature">d91a5ce933457fd5</str>
</doc>
?
<doc>
<str name="doctitle">ISA Letter</str>
<str name="signature">d91a5ce933457fd5</str>
</doc>
?
<doc>
<str name="doctitle">ISA Mailing pack letter</str>
<str name="signature">fd9d9e1c0de32fb5</str>
</doc>
If you wish to view the St. James's Place email disclaimer, please use the link below
http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: De-duplication not working as I expected - duplicates still getting into the index
Posted by Markus Jelsma <ma...@openindex.io>.
Check this setting:
<bool name="overwriteDupes">false</bool>
On Tuesday 14 December 2010 14:26:21 Jason Brown wrote:
> I have configured de-duplication according to the Wiki..........
>
> My signature field is defined thus...
>
> <field name="signature" type="string" stored="true" indexed="true"
> multiValued="false" />
>
> and my updateRequestProcessor as follows....
>
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <bool name="overwriteDupes">false</bool>
> <str name="signatureField">signature</str>
> <str name="fields">content</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</s
> tr> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> I am using SOLRJ to write to the index with the binary (as opposed to XML)
> so my update handler is defined as below.....
>
> <requestHandler name="/update/javabin"
> class="solr.BinaryUpdateRequestHandler" > <lst name="defaults">
> <str name="update.processor">dedupe</str>
> </lst>
> </requestHandler>
>
> However I was expecting SOLR to only allow 1 instance of a duplicate
> document into the index, but I get the following results when I query mt
> index...
>
> I have deliberately added my ISA Letter file 4 times and can see it has
> correctly generated an identical signature for the first 4 entries
> (d91a5ce933457fd5). The fifth entry is a different document and correctly
> has a different signature.
>
> I was expecting to only see 1 instance of the duplicate. Am I
> misinterpreting the way it works? Many Thanks.
>
> <result name="response" numFound="36" start="0">
> ?
> <doc>
> <str name="doctitle">ISA Letter</str>
> <str name="signature">d91a5ce933457fd5</str>
> </doc>
> ?
> <doc>
> <str name="doctitle">ISA Letter</str>
> <str name="signature">d91a5ce933457fd5</str>
> </doc>
> ?
> <doc>
> <str name="doctitle">ISA Letter</str>
> <str name="signature">d91a5ce933457fd5</str>
> </doc>
> ?
> <doc>
> <str name="doctitle">ISA Letter</str>
> <str name="signature">d91a5ce933457fd5</str>
> </doc>
> ?
> <doc>
> <str name="doctitle">ISA Mailing pack letter</str>
> <str name="signature">fd9d9e1c0de32fb5</str>
> </doc>
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350