You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2009/02/23 04:18:04 UTC

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

    [ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675748#action_12675748 ] 

Lance Norskog commented on SOLR-799:
------------------------------------

I came into Solr with no search experience and it was quite a learning curve. The modular design of the configuration really helped, and we should maintain that modularity. There are two different designs: the design of the configuration and the design of the implementation. This comment only addresses the design of the configuration files.  

The patch as committed moves the specification of one field out of schema.xml file to another file. This breaks the modularity of the configurations.  I suggest that the files should look like this:

schema.xml:
<field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" />

solrconfig.xml:
<updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <string name="signatureField">signatureField</string>
      <bool name="enabled">false</bool>
      <bool name="overwriteDupes">true</bool>
   </processor>
   
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml. 



> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.