You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tinman <th...@gmail.com> on 2011/05/29 01:47:26 UTC

How to ignore whitespace/ case sensitivity with dedupe

Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField"
name="text_ws_lower" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
<field name="name" type="text_ws_lower"/>
<field name="signatureField" type="text_ws_lower"/>

and in the solrconfig.xml <updateRequestProcessorChain name="dedupe">
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">signatureField</str>
      <str name="fields">name</str>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

I know a possible solution is to lowercase and remove white spaces for the
field "name" before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn      SMITh the documents have the same outcome in
signatureField?

Thanks heaps
Cheers
tinman







--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997624.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to ignore whitespace/ case sensitivity with dedupe

Posted by tinman <th...@gmail.com>.

By default, stored = true, indexed = true. Any case, this is an example
output from solr search console.

<result name="response" numFound="2" start="0">
  <doc>
    <str name="id">1234</str>
    <str name="name">JOHN   SMITH </str>
    <str name="signatureField">5430fbe9e6374611</str></doc>
  <doc>
    <str name="id">1233</str>
    <str name="name">   john SMITh</str>
    <str name="signatureField">49867a7835ff6741</str></doc>
</result>

As you can see, the 2 signature fields are different. And I want the
overrides = false as I want to use field collapsing for removing dedupe at
query time.

Thanks
tinman


--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997738.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to ignore whitespace/ case sensitivity with dedupe

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(11/05/29 8:47), tinman wrote:
> Hi all,
>
> I've followed the instructions at this link
> http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
> working. However, it doesn't seem to recognize case differences or white
> space differences even thought I've defined the type of the fields to be
> used for dedupe as well as the signature field as followings in schema.xml
>
> <fieldType autoGeneratePhraseQueries="true" class="solr.TextField"
> name="text_ws_lower" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>      </fieldType>
> <field name="name" type="text_ws_lower"/>
> <field name="signatureField" type="text_ws_lower"/>
>
> and in the solrconfig.xml<updateRequestProcessorChain name="dedupe">
>      <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>        <bool name="enabled">true</bool>
>        <bool name="overwriteDupes">false</bool>
>        <str name="signatureField">signatureField</str>
>        <str name="fields">name</str>
>        <str
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
>      </processor>
>      <processor class="solr.LogUpdateProcessorFactory" />
>      <processor class="solr.RunUpdateProcessorFactory" />
>    </updateRequestProcessorChain>
>
> I know a possible solution is to lowercase and remove white spaces for the
> field "name" before submiting documents to solr, but is there any other
> alternatives so that when the following data is given
> Name: JOHN SMITH and jOhn      SMITh the documents have the same outcome in
> signatureField?

I can't believe this. Those signatures should be different.

Are you sure you see same signatures in signatureField (it should be stored=true
in order to see the result of signature)? Or did you just see those duplicate documents
were registered and not checked signatureField by yourself? If latter, it is feature.
Because you set overwriteDupes=false and it mean duplication check works on uniqueKey field.

koji
-- 
http://www.rondhuit.com/en/