You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rajdeep Sahoo <ra...@gmail.com> on 2020/09/17 18:34:46 UTC

How to remove duplicate tokens from solr

Hi team,
 Is there any way to remove duplicate tokens from solr. Is there any filter
for this.

Re: How to remove duplicate tokens from solr

Posted by Rajdeep Sahoo <ra...@gmail.com>.
Hi all,
 I have found the below details in stackoverflow but not sure how to
include the jar. Can any one help with this?


I've created a new filter class from "FilteringTokenFilter". The task is
pretty simple. I would check before adding into the list.

I have created a simple plugin Eliminate duplicate words
<https://github.com/volkan/lucene-solr-filter-eliminateduplicate>

To load the plugins, JAR files (along with EliminateDuplicate-*.jar, which
can be created by executing mvn package command or
https://github.com/volkan/lucene-solr-filter-eliminateduplicate/tree/master/solr/lib)
in a lib directory in the Solr Home directory. The location for the lib
directory is near the solr.xml file.

On Fri, 18 Sep, 2020, 1:04 am Rajdeep Sahoo, <ra...@gmail.com>
wrote:

> But not sure why these type of search string is causing high cpu
> utilization.
>
> On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <ra...@gmail.com>
> wrote:
>
>> Is this for a phrase search? If yes then the position of the token would
>> matter too and not sure which token would you want to remove. "eg
>> "tshirt hat tshirt".
>> Also, are you looking to save space and want this at index time? Or just
>> want to remove duplicates from the search string?
>>
>> If this is at search time AND is not a phrase search, there are a couple
>> approaches I could think of :
>>
>> 1) You could either handle this in the application layer to only pass the
>> deduplicated string before it hits solr
>> 2) You can write a custom search component and configure it in the
>>  <first-components> list to process the search string and remove
>> duplicates
>> before it hits the default search components. See here (
>>
>> https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
>> ).
>>
>> However if for search, I would still evaluate if writing those extra lines
>> of code is worth the investment. I say so since my assumption is that for
>> duplicated tokens in search string, lucene would have the intelligence to
>> not fetch the doc ids again, so you should not be worried about spending
>> computation resources to reevaluate the same tokens (Someone correct me if
>> I am wrong!)
>>
>> -Rahul
>>
>> On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <rajdeepsahoo2012@gmail.com
>> >
>> wrote:
>>
>> > If someone is searching with " tshirt tshirt tshirt tshirt tshirt
>> tshirt"
>> > we need to remove the duplicates and search with tshirt.
>> >
>> >
>> > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <
>> arafalov@gmail.com>
>> > wrote:
>> >
>> > > This is not quite enough information.
>> > > There is
>> > >
>> >
>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
>> > > but it has specific limitations.
>> > >
>> > > What is the problem that you are trying to solve that you feel is due
>> > > to duplicate tokens? Why are they duplicates? Is it about storage or
>> > > relevancy?
>> > >
>> > > Regards,
>> > >    Alex.
>> > >
>> > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <
>> rajdeepsahoo2012@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi team,
>> > > >  Is there any way to remove duplicate tokens from solr. Is there any
>> > > filter
>> > > > for this.
>> > >
>> >
>>
>

Re: How to remove duplicate tokens from solr

Posted by Rajdeep Sahoo <ra...@gmail.com>.
But not sure why these type of search string is causing high cpu
utilization.

On Fri, 18 Sep, 2020, 12:49 am Rahul Goswami, <ra...@gmail.com> wrote:

> Is this for a phrase search? If yes then the position of the token would
> matter too and not sure which token would you want to remove. "eg
> "tshirt hat tshirt".
> Also, are you looking to save space and want this at index time? Or just
> want to remove duplicates from the search string?
>
> If this is at search time AND is not a phrase search, there are a couple
> approaches I could think of :
>
> 1) You could either handle this in the application layer to only pass the
> deduplicated string before it hits solr
> 2) You can write a custom search component and configure it in the
>  <first-components> list to process the search string and remove duplicates
> before it hits the default search components. See here (
>
> https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
> ).
>
> However if for search, I would still evaluate if writing those extra lines
> of code is worth the investment. I say so since my assumption is that for
> duplicated tokens in search string, lucene would have the intelligence to
> not fetch the doc ids again, so you should not be worried about spending
> computation resources to reevaluate the same tokens (Someone correct me if
> I am wrong!)
>
> -Rahul
>
> On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <ra...@gmail.com>
> wrote:
>
> > If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> > we need to remove the duplicates and search with tshirt.
> >
> >
> > On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <
> arafalov@gmail.com>
> > wrote:
> >
> > > This is not quite enough information.
> > > There is
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> > > but it has specific limitations.
> > >
> > > What is the problem that you are trying to solve that you feel is due
> > > to duplicate tokens? Why are they duplicates? Is it about storage or
> > > relevancy?
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <
> rajdeepsahoo2012@gmail.com>
> > > wrote:
> > > >
> > > > Hi team,
> > > >  Is there any way to remove duplicate tokens from solr. Is there any
> > > filter
> > > > for this.
> > >
> >
>

Re: How to remove duplicate tokens from solr

Posted by Rahul Goswami <ra...@gmail.com>.
Is this for a phrase search? If yes then the position of the token would
matter too and not sure which token would you want to remove. "eg
"tshirt hat tshirt".
Also, are you looking to save space and want this at index time? Or just
want to remove duplicates from the search string?

If this is at search time AND is not a phrase search, there are a couple
approaches I could think of :

1) You could either handle this in the application layer to only pass the
deduplicated string before it hits solr
2) You can write a custom search component and configure it in the
 <first-components> list to process the search string and remove duplicates
before it hits the default search components. See here (
https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
).

However if for search, I would still evaluate if writing those extra lines
of code is worth the investment. I say so since my assumption is that for
duplicated tokens in search string, lucene would have the intelligence to
not fetch the doc ids again, so you should not be worried about spending
computation resources to reevaluate the same tokens (Someone correct me if
I am wrong!)

-Rahul

On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo <ra...@gmail.com>
wrote:

> If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> we need to remove the duplicates and search with tshirt.
>
>
> On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <ar...@gmail.com>
> wrote:
>
> > This is not quite enough information.
> > There is
> >
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> > but it has specific limitations.
> >
> > What is the problem that you are trying to solve that you feel is due
> > to duplicate tokens? Why are they duplicates? Is it about storage or
> > relevancy?
> >
> > Regards,
> >    Alex.
> >
> > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <ra...@gmail.com>
> > wrote:
> > >
> > > Hi team,
> > >  Is there any way to remove duplicate tokens from solr. Is there any
> > filter
> > > for this.
> >
>

Re: How to remove duplicate tokens from solr

Posted by Rajdeep Sahoo <ra...@gmail.com>.
If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
we need to remove the duplicates and search with tshirt.


On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, <ar...@gmail.com>
wrote:

> This is not quite enough information.
> There is
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> but it has specific limitations.
>
> What is the problem that you are trying to solve that you feel is due
> to duplicate tokens? Why are they duplicates? Is it about storage or
> relevancy?
>
> Regards,
>    Alex.
>
> On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <ra...@gmail.com>
> wrote:
> >
> > Hi team,
> >  Is there any way to remove duplicate tokens from solr. Is there any
> filter
> > for this.
>

Re: How to remove duplicate tokens from solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
This is not quite enough information.
There is https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
but it has specific limitations.

What is the problem that you are trying to solve that you feel is due
to duplicate tokens? Why are they duplicates? Is it about storage or
relevancy?

Regards,
   Alex.

On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo <ra...@gmail.com> wrote:
>
> Hi team,
>  Is there any way to remove duplicate tokens from solr. Is there any filter
> for this.