You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Derek Poh <dp...@globalsources.com> on 2020/09/30 02:27:24 UTC

advice on whether to use stopwords for use case

Hi

I have read in the mailings list that we should try to avoid using stop 
words.

I have a use case where I would like to know if there is other 
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is 
cigarette related words and the search is coming from a particular 
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains 
single word, multiple words (Electronic cigar), multiple words with 
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will 
include the stopword filter in the index and query stage, for this 
module to use.

For this use case, other than using stop words to handle it, is there 
any alternative solution?

Derek

----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

Posted by Derek Poh <dp...@globalsources.com>.
Hi Alex

The business requirement (for now) is not to return any result when the 
search keywords are cigarette related. The business user team will 
provide the list of the cigarette related keywords.

Will digest, explore and research on your suggestions. Thank you.

On 30/9/2020 10:56 am, Alexandre Rafalovitch wrote:
> I am not sure why you think stop words are your first choice. Maybe I
> misunderstand the question. I read it as that you need to exclude
> completely a set of documents that include specific keywords when
> called from specific module.
>
> If I wanted to differentiate the searches from specific module, I
> would give that module a different end-point (Request Query Handler),
> instead of /select. So, /nocigs or whatever.
>
> Then, in that end-point, you could do all sorts of extra things, such
> as setting appends or even invariants parameters, which would include
> filter query to exclude any documents matching specific keywords. I
> assume it is ok to return documents that are matching for other
> reasons.
>
> Ideally, you would mark the cigs documents during indexing with a
> binary or enumeration flag and then during search you just need to
> check against that flag. In that case, you could copyField  your text
> and run it against something like
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
> combined with Shingles for multiwords. Or similar. And just transform
> it as index-only so that the result is basically a yes/no flag.
> Similar thing could be done with UpdateRequestProcessor pipeline if
> you want to end up with a true boolean flag. The idea is the same,
> just to have an index-only flag that you force lock into for any
> request from specific module.
>
> Or even with something like ElevationSearchComponent. Same idea.
>
> Hope this helps.
>
> Regards,
>     Alex.
>
> On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> wrote:
>> Hi
>>
>> I have read in the mailings list that we should try to avoid using stop
>> words.
>>
>> I have a use case where I would like to know if there is other
>> alternative solutions beside using stop words.
>>
>> There is business requirement to return zero result when the search is
>> cigarette related words and the search is coming from a particular
>> module on our site. It does not apply to all searches from our site.
>> There is a list of these cigarette related words. This list contains
>> single word, multiple words (Electronic cigar), multiple words with
>> punctuation (e-cigarette case).
>> I am planning to copy a different set of search fields, that will
>> include the stopword filter in the index and query stage, for this
>> module to use.
>>
>> For this use case, other than using stop words to handle it, is there
>> any alternative solution?
>>
>> Derek
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>>
>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>
>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

Posted by Walter Underwood <wu...@wunderwood.org>.
I can’t think of an easy way to do this in Solr.

Do a bunch of string searches on the query on the client side. If any of them match, 
make a “no hits” result page.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:56 PM, Derek Poh <dp...@globalsources.com> wrote:
> 
> Yes, the requirements (for now) is not to return any results. I think they may change the requirements,pending their return from the holidays.
> 
>> If so, then check for those words in the query before sending it to Solr.
> That is what I think so too.
> 
> Thinking further, using stopwords for this, there will still be results return when the number of words in the search keywords is more than the stopwords.
> 
> On 1/10/2020 2:57 am, Walter Underwood wrote:
>> I’m not clear on the requirements. It sounds like the query “cigar” or “cuban cigar”
>> should return zero results. Is that right?
>> 
>> If so, then check for those words in the query before sending it to Solr.
>> 
>> But the stopwords approach seems like the requirement is different. Could you give
>> some examples?
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <ma...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch <ar...@gmail.com> <ma...@gmail.com> wrote:
>>> 
>>> You may also want to look at something like: https://docs.querqy.org/index.html <https://docs.querqy.org/index.html>
>>> 
>>> ApacheCon had (is having..) a presentation on it that seemed quite
>>> relevant to your needs. The videos should be live in a week or so.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch <ar...@gmail.com> <ma...@gmail.com> wrote:
>>>> I am not sure why you think stop words are your first choice. Maybe I
>>>> misunderstand the question. I read it as that you need to exclude
>>>> completely a set of documents that include specific keywords when
>>>> called from specific module.
>>>> 
>>>> If I wanted to differentiate the searches from specific module, I
>>>> would give that module a different end-point (Request Query Handler),
>>>> instead of /select. So, /nocigs or whatever.
>>>> 
>>>> Then, in that end-point, you could do all sorts of extra things, such
>>>> as setting appends or even invariants parameters, which would include
>>>> filter query to exclude any documents matching specific keywords. I
>>>> assume it is ok to return documents that are matching for other
>>>> reasons.
>>>> 
>>>> Ideally, you would mark the cigs documents during indexing with a
>>>> binary or enumeration flag and then during search you just need to
>>>> check against that flag. In that case, you could copyField  your text
>>>> and run it against something like
>>>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter <https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter>
>>>> combined with Shingles for multiwords. Or similar. And just transform
>>>> it as index-only so that the result is basically a yes/no flag.
>>>> Similar thing could be done with UpdateRequestProcessor pipeline if
>>>> you want to end up with a true boolean flag. The idea is the same,
>>>> just to have an index-only flag that you force lock into for any
>>>> request from specific module.
>>>> 
>>>> Or even with something like ElevationSearchComponent. Same idea.
>>>> 
>>>> Hope this helps.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> 
>>>> On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> <ma...@globalsources.com> wrote:
>>>>> Hi
>>>>> 
>>>>> I have read in the mailings list that we should try to avoid using stop
>>>>> words.
>>>>> 
>>>>> I have a use case where I would like to know if there is other
>>>>> alternative solutions beside using stop words.
>>>>> 
>>>>> There is business requirement to return zero result when the search is
>>>>> cigarette related words and the search is coming from a particular
>>>>> module on our site. It does not apply to all searches from our site.
>>>>> There is a list of these cigarette related words. This list contains
>>>>> single word, multiple words (Electronic cigar), multiple words with
>>>>> punctuation (e-cigarette case).
>>>>> I am planning to copy a different set of search fields, that will
>>>>> include the stopword filter in the index and query stage, for this
>>>>> module to use.
>>>>> 
>>>>> For this use case, other than using stop words to handle it, is there
>>>>> any alternative solution?
>>>>> 
>>>>> Derek
>>>>> 
>>>>> ----------------------
>>>>> CONFIDENTIALITY NOTICE
>>>>> 
>>>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>>>> 
>>>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>> 
> 
> 
> 
> 
> 
> ---------------------- 
> CONFIDENTIALITY NOTICE 
> 
> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 
> 
> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
> 
> 


Re: advice on whether to use stopwords for use case

Posted by Derek Poh <dp...@globalsources.com>.
Yes, the requirements (for now) is not to return any results. I think 
they may change the requirements,pending their return from the holidays.

> If so, then check for those words in the query before sending it to Solr.
That is what I think so too.

Thinking further, using stopwords for this, there will still be results 
return when the number of words in the search keywords is more than the 
stopwords.

On 1/10/2020 2:57 am, Walter Underwood wrote:
> I’m not clear on the requirements. It sounds like the query “cigar” or “cuban cigar”
> should return zero results. Is that right?
>
> If so, then check for those words in the query before sending it to Solr.
>
> But the stopwords approach seems like the requirement is different. Could you give
> some examples?
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>> You may also want to look at something like: https://docs.querqy.org/index.html
>>
>> ApacheCon had (is having..) a presentation on it that seemed quite
>> relevant to your needs. The videos should be live in a week or so.
>>
>> Regards,
>>    Alex.
>>
>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>> I am not sure why you think stop words are your first choice. Maybe I
>>> misunderstand the question. I read it as that you need to exclude
>>> completely a set of documents that include specific keywords when
>>> called from specific module.
>>>
>>> If I wanted to differentiate the searches from specific module, I
>>> would give that module a different end-point (Request Query Handler),
>>> instead of /select. So, /nocigs or whatever.
>>>
>>> Then, in that end-point, you could do all sorts of extra things, such
>>> as setting appends or even invariants parameters, which would include
>>> filter query to exclude any documents matching specific keywords. I
>>> assume it is ok to return documents that are matching for other
>>> reasons.
>>>
>>> Ideally, you would mark the cigs documents during indexing with a
>>> binary or enumeration flag and then during search you just need to
>>> check against that flag. In that case, you could copyField  your text
>>> and run it against something like
>>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>>> combined with Shingles for multiwords. Or similar. And just transform
>>> it as index-only so that the result is basically a yes/no flag.
>>> Similar thing could be done with UpdateRequestProcessor pipeline if
>>> you want to end up with a true boolean flag. The idea is the same,
>>> just to have an index-only flag that you force lock into for any
>>> request from specific module.
>>>
>>> Or even with something like ElevationSearchComponent. Same idea.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>>    Alex.
>>>
>>> On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> wrote:
>>>> Hi
>>>>
>>>> I have read in the mailings list that we should try to avoid using stop
>>>> words.
>>>>
>>>> I have a use case where I would like to know if there is other
>>>> alternative solutions beside using stop words.
>>>>
>>>> There is business requirement to return zero result when the search is
>>>> cigarette related words and the search is coming from a particular
>>>> module on our site. It does not apply to all searches from our site.
>>>> There is a list of these cigarette related words. This list contains
>>>> single word, multiple words (Electronic cigar), multiple words with
>>>> punctuation (e-cigarette case).
>>>> I am planning to copy a different set of search fields, that will
>>>> include the stopword filter in the index and query stage, for this
>>>> module to use.
>>>>
>>>> For this use case, other than using stop words to handle it, is there
>>>> any alternative solution?
>>>>
>>>> Derek
>>>>
>>>> ----------------------
>>>> CONFIDENTIALITY NOTICE
>>>>
>>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>>>
>>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

Posted by Walter Underwood <wu...@wunderwood.org>.
I’m not clear on the requirements. It sounds like the query “cigar” or “cuban cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you give
some examples?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> 
> You may also want to look at something like: https://docs.querqy.org/index.html
> 
> ApacheCon had (is having..) a presentation on it that seemed quite
> relevant to your needs. The videos should be live in a week or so.
> 
> Regards,
>   Alex.
> 
> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>> 
>> I am not sure why you think stop words are your first choice. Maybe I
>> misunderstand the question. I read it as that you need to exclude
>> completely a set of documents that include specific keywords when
>> called from specific module.
>> 
>> If I wanted to differentiate the searches from specific module, I
>> would give that module a different end-point (Request Query Handler),
>> instead of /select. So, /nocigs or whatever.
>> 
>> Then, in that end-point, you could do all sorts of extra things, such
>> as setting appends or even invariants parameters, which would include
>> filter query to exclude any documents matching specific keywords. I
>> assume it is ok to return documents that are matching for other
>> reasons.
>> 
>> Ideally, you would mark the cigs documents during indexing with a
>> binary or enumeration flag and then during search you just need to
>> check against that flag. In that case, you could copyField  your text
>> and run it against something like
>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>> combined with Shingles for multiwords. Or similar. And just transform
>> it as index-only so that the result is basically a yes/no flag.
>> Similar thing could be done with UpdateRequestProcessor pipeline if
>> you want to end up with a true boolean flag. The idea is the same,
>> just to have an index-only flag that you force lock into for any
>> request from specific module.
>> 
>> Or even with something like ElevationSearchComponent. Same idea.
>> 
>> Hope this helps.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> wrote:
>>> 
>>> Hi
>>> 
>>> I have read in the mailings list that we should try to avoid using stop
>>> words.
>>> 
>>> I have a use case where I would like to know if there is other
>>> alternative solutions beside using stop words.
>>> 
>>> There is business requirement to return zero result when the search is
>>> cigarette related words and the search is coming from a particular
>>> module on our site. It does not apply to all searches from our site.
>>> There is a list of these cigarette related words. This list contains
>>> single word, multiple words (Electronic cigar), multiple words with
>>> punctuation (e-cigarette case).
>>> I am planning to copy a different set of search fields, that will
>>> include the stopword filter in the index and query stage, for this
>>> module to use.
>>> 
>>> For this use case, other than using stop words to handle it, is there
>>> any alternative solution?
>>> 
>>> Derek
>>> 
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> 
>>> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>>> 
>>> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.


Re: advice on whether to use stopwords for use case

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
> I am not sure why you think stop words are your first choice. Maybe I
> misunderstand the question. I read it as that you need to exclude
> completely a set of documents that include specific keywords when
> called from specific module.
>
> If I wanted to differentiate the searches from specific module, I
> would give that module a different end-point (Request Query Handler),
> instead of /select. So, /nocigs or whatever.
>
> Then, in that end-point, you could do all sorts of extra things, such
> as setting appends or even invariants parameters, which would include
> filter query to exclude any documents matching specific keywords. I
> assume it is ok to return documents that are matching for other
> reasons.
>
> Ideally, you would mark the cigs documents during indexing with a
> binary or enumeration flag and then during search you just need to
> check against that flag. In that case, you could copyField  your text
> and run it against something like
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
> combined with Shingles for multiwords. Or similar. And just transform
> it as index-only so that the result is basically a yes/no flag.
> Similar thing could be done with UpdateRequestProcessor pipeline if
> you want to end up with a true boolean flag. The idea is the same,
> just to have an index-only flag that you force lock into for any
> request from specific module.
>
> Or even with something like ElevationSearchComponent. Same idea.
>
> Hope this helps.
>
> Regards,
>    Alex.
>
> On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> wrote:
> >
> > Hi
> >
> > I have read in the mailings list that we should try to avoid using stop
> > words.
> >
> > I have a use case where I would like to know if there is other
> > alternative solutions beside using stop words.
> >
> > There is business requirement to return zero result when the search is
> > cigarette related words and the search is coming from a particular
> > module on our site. It does not apply to all searches from our site.
> > There is a list of these cigarette related words. This list contains
> > single word, multiple words (Electronic cigar), multiple words with
> > punctuation (e-cigarette case).
> > I am planning to copy a different set of search fields, that will
> > include the stopword filter in the index and query stage, for this
> > module to use.
> >
> > For this use case, other than using stop words to handle it, is there
> > any alternative solution?
> >
> > Derek
> >
> > ----------------------
> > CONFIDENTIALITY NOTICE
> >
> > This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
> >
> > This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh <dp...@globalsources.com> wrote:
>
> Hi
>
> I have read in the mailings list that we should try to avoid using stop
> words.
>
> I have a use case where I would like to know if there is other
> alternative solutions beside using stop words.
>
> There is business requirement to return zero result when the search is
> cigarette related words and the search is coming from a particular
> module on our site. It does not apply to all searches from our site.
> There is a list of these cigarette related words. This list contains
> single word, multiple words (Electronic cigar), multiple words with
> punctuation (e-cigarette case).
> I am planning to copy a different set of search fields, that will
> include the stopword filter in the index and query stage, for this
> module to use.
>
> For this use case, other than using stop words to handle it, is there
> any alternative solution?
>
> Derek
>
> ----------------------
> CONFIDENTIALITY NOTICE
>
> This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.
>
> This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.