You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "vrparekh@gmail.com" <vr...@gmail.com> on 2012/12/25 12:12:02 UTC

how to use RemoveDuplicatesTokenFilterFactory?

I want to avoid duplicate values in one multivalued field.

i am using dataimport handler to import data,  the particular multivalued
field are being filled from xml source. now that xml has duplicate values,
but i want to have unique valued in this multivalued field.

e.g. xml
<data>
     a1 
     b1 
     a1 
     a1 
</data>

i have added RemoveDuplicatesTokenFilterFactory in data type of the field,
in index analyzer.
still it gives below o/p.

<arr name="field">
  <str>a1</str>
  <str>b1</str>
  <str>a1</str>
  <str>a1</str>
</arr>

i am using solr 3.5.

how can i avoid importing duplicate values in the field?



--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-use-RemoveDuplicatesTokenFilterFactory-tp4029004.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

Posted by Jack Krupansky <ja...@basetechnology.com>.
Your stated problem seems to have nothing to do with the message subject 
line relating to RemoveDuplicatesTokenFilterFactory. Please start a new 
message thread unless you really are concerned with an issue related to 
RemoveDuplicatesTokenFilterFactory.

This kind of "thread hijacking" is inappropriate for this email list (or any 
email list.)

-- Jack Krupansky

-----Original Message----- 
From: tuedel
Sent: Monday, July 01, 2013 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate 
values in multivalued field

Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:

<updateRequestProcessorChain name="uniq_fields">
   <processor
class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
     <lst name="fields">
       <str>title</str>
   <str>tag_type</str>
     </lst>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">
   <lst name="defaults">
      <str name="update.chain">uniq_fields</str>
    </lst>
  </requestHandler>

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

Posted by tuedel <se...@web.de>.
Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:

<updateRequestProcessorChain name="uniq_fields">
   <processor
class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
     <lst name="fields">
       <str>title</str>
	   <str>tag_type</str>
     </lst>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

 <requestHandler name="/update" class="solr.UpdateRequestHandler">    
   <lst name="defaults">
      <str name="update.chain">uniq_fields</str>
    </lst>
  </requestHandler>

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an 

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to use RemoveDuplicatesTokenFilterFactory?

Posted by "vrparekh@gmail.com" <vr...@gmail.com>.
Thanks  iorixxx,

i tried below and it works fine. Thank you very much.

<updateRequestProcessorChain>
   <processor
class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
     <lst name="fields">
       <str>field</str>
     </lst>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>






--
View this message in context: http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4029099.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to use RemoveDuplicatesTokenFilterFactory?

Posted by Ahmet Arslan <io...@yahoo.com>.
> The values are at same logical
> position.

You mean positionIncrementGap set to 0? can you see that duplicates are removed in analysis page?

By the way returned values are original (stored) values. Analysis (tokenfilter tokenizer etc) are about indexed values. UpdateProcessorFactory can change stored ( returned) values.

Re: how to use RemoveDuplicatesTokenFilterFactory?

Posted by "vrparekh@gmail.com" <vr...@gmail.com>.
The values are at same logical position.



--
View this message in context: http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4029017.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to use RemoveDuplicatesTokenFilterFactory?

Posted by Ahmet Arslan <io...@yahoo.com>.
> I want to avoid duplicate values in
> one multivalued field.
> 
> i am using dataimport handler to import data,  the
> particular multivalued
> field are being filled from xml source. now that xml has
> duplicate values,
> but i want to have unique valued in this multivalued field.
> 
> e.g. xml
> <data>
>      a1 
>      b1 
>      a1 
>      a1 
> </data>
> 
> i have added RemoveDuplicatesTokenFilterFactory in data type
> of the field,
> in index analyzer.
> still it gives below o/p.
> 
> <arr name="field">
>   <str>a1</str>
>   <str>b1</str>
>   <str>a1</str>
>   <str>a1</str>
> </arr>
> 
> i am using solr 3.5.
> 
> how can i avoid importing duplicate values in the field?
> 

RDTF removes duplicates at the same position. 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Elegant solution would be subclass the 
http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html

and create DistinctFieldValueUpdateProcessorFactory or something like that. MinFieldValueUpdateProcessorFactory can be used as an example.