You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pratik Patel <pr...@semandex.net> on 2020/02/07 18:27:18 UTC

Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Hello Everyone,

Let's say I have an analyzer which has following token stream as an output.

*token stream : [], a, ab, [], c, [], d, de, def .....*

Now let's say I want to add another filter which will drop a certain tokens
based on whether adjacent token on the right side is [] or some string.

for a given token,
     drop/replace it by empty string it if there is a non-empty string
token on its right and
     keep it if there is an empty token string on its right.

based on this, the resulting token stream would be like this.

*desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
de<dropped>, def *


*Is there any Filter available in solr with which this can be achieved?*
*If writing a custom filter is the only possible option then I want to know
whether its possible to access adjacent tokens in the custom filter?*

*Any idea about this would be really helpful.*

Thanks,
Pratik

Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Posted by Emir Arnautović <em...@sematext.com>.
Hi Pratik,
Shingle filter should do that.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 10 Feb 2020, at 18:57, Pratik Patel <pr...@semandex.net> wrote:
> 
> Thanks for the reply Emir.
> 
> I will be exploring the option of creating a custom filter. It's good to
> know that we can consume more than one tokens from previous filter and emit
> different number of tokens. Do you know of any existing filter in Solr
> which does something similar? It would be greatly helpful to see how more
> than one tokens can be consumed. I can implement my custom logic once I
> have access to multiple tokens from previous filter.
> 
> Thanks
> Pratik
> 
> On Mon, Feb 10, 2020 at 2:47 AM Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> 
>> Hi Pratik,
>> You might be able to do some of required things using
>> PatternReplaceChartFilter, but as you can see it does not operate on tokens
>> level but input string. Your best bet is custom token filter. Not sure how
>> familiar you are with how token filters work, but you have access to tokens
>> from previous filter and you can implement any logic you want: you consume
>> three tokens and emit tokens based on adjacent tokens.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 7 Feb 2020, at 19:27, Pratik Patel <pr...@semandex.net> wrote:
>>> 
>>> Hello Everyone,
>>> 
>>> Let's say I have an analyzer which has following token stream as an
>> output.
>>> 
>>> *token stream : [], a, ab, [], c, [], d, de, def .....*
>>> 
>>> Now let's say I want to add another filter which will drop a certain
>> tokens
>>> based on whether adjacent token on the right side is [] or some string.
>>> 
>>> for a given token,
>>>    drop/replace it by empty string it if there is a non-empty string
>>> token on its right and
>>>    keep it if there is an empty token string on its right.
>>> 
>>> based on this, the resulting token stream would be like this.
>>> 
>>> *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
>>> de<dropped>, def *
>>> 
>>> 
>>> *Is there any Filter available in solr with which this can be achieved?*
>>> *If writing a custom filter is the only possible option then I want to
>> know
>>> whether its possible to access adjacent tokens in the custom filter?*
>>> 
>>> *Any idea about this would be really helpful.*
>>> 
>>> Thanks,
>>> Pratik
>> 
>> 


Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Posted by Pratik Patel <pr...@semandex.net>.
Thanks for the reply Emir.

I will be exploring the option of creating a custom filter. It's good to
know that we can consume more than one tokens from previous filter and emit
different number of tokens. Do you know of any existing filter in Solr
which does something similar? It would be greatly helpful to see how more
than one tokens can be consumed. I can implement my custom logic once I
have access to multiple tokens from previous filter.

Thanks
Pratik

On Mon, Feb 10, 2020 at 2:47 AM Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Hi Pratik,
> You might be able to do some of required things using
> PatternReplaceChartFilter, but as you can see it does not operate on tokens
> level but input string. Your best bet is custom token filter. Not sure how
> familiar you are with how token filters work, but you have access to tokens
> from previous filter and you can implement any logic you want: you consume
> three tokens and emit tokens based on adjacent tokens.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 Feb 2020, at 19:27, Pratik Patel <pr...@semandex.net> wrote:
> >
> > Hello Everyone,
> >
> > Let's say I have an analyzer which has following token stream as an
> output.
> >
> > *token stream : [], a, ab, [], c, [], d, de, def .....*
> >
> > Now let's say I want to add another filter which will drop a certain
> tokens
> > based on whether adjacent token on the right side is [] or some string.
> >
> > for a given token,
> >     drop/replace it by empty string it if there is a non-empty string
> > token on its right and
> >     keep it if there is an empty token string on its right.
> >
> > based on this, the resulting token stream would be like this.
> >
> > *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
> > de<dropped>, def *
> >
> >
> > *Is there any Filter available in solr with which this can be achieved?*
> > *If writing a custom filter is the only possible option then I want to
> know
> > whether its possible to access adjacent tokens in the custom filter?*
> >
> > *Any idea about this would be really helpful.*
> >
> > Thanks,
> > Pratik
>
>

Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Posted by Emir Arnautović <em...@sematext.com>.
Hi Pratik,
You might be able to do some of required things using PatternReplaceChartFilter, but as you can see it does not operate on tokens level but input string. Your best bet is custom token filter. Not sure how familiar you are with how token filters work, but you have access to tokens from previous filter and you can implement any logic you want: you consume three tokens and emit tokens based on adjacent tokens.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2020, at 19:27, Pratik Patel <pr...@semandex.net> wrote:
> 
> Hello Everyone,
> 
> Let's say I have an analyzer which has following token stream as an output.
> 
> *token stream : [], a, ab, [], c, [], d, de, def .....*
> 
> Now let's say I want to add another filter which will drop a certain tokens
> based on whether adjacent token on the right side is [] or some string.
> 
> for a given token,
>     drop/replace it by empty string it if there is a non-empty string
> token on its right and
>     keep it if there is an empty token string on its right.
> 
> based on this, the resulting token stream would be like this.
> 
> *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
> de<dropped>, def *
> 
> 
> *Is there any Filter available in solr with which this can be achieved?*
> *If writing a custom filter is the only possible option then I want to know
> whether its possible to access adjacent tokens in the custom filter?*
> 
> *Any idea about this would be really helpful.*
> 
> Thanks,
> Pratik