You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by A Adel <aa...@gmail.com> on 2020/05/15 05:47:52 UTC

Dynamic Stopwords

Hi - Is there a way to configure stop words to be dynamic for each document
based on the language detected of a multilingual text field? Combining all
languages stop words in one set is a possibility however it introduces
false positives for some language combinations, such as German and English.
Thanks, A.

Re: Dynamic Stopwords

Posted by Tim Casey <tc...@gmail.com>.
What I have done for this in the past is calculating the expected value of
a symbol within a universe.  Then calculating the difference between
expected value and the actual value at the time you see a symbol.  Take the
difference and use the most surprising symbols, in rank order from most
surprising to least surprising, dropping lower frequency/unique values.
This was a fairly length independent way to get to interesting tokens.

Most calculations around stop words are very difficult to maintain and
handle.  You can have 7 English stop words easy.  Then you go to a larger
set, say 30ish, then another larger set say 150.  The problem is as you
remove stop words, you remove some meaning.  You will see an example of
this when you want to know the difference between 'a noun' and 'the noun'.
  Now that we have covered English and chosen the optimal set of stop words
for a particular set of text, a new language comes around.  Eventually the
stop words become a contributing factor of error.  The other reason to not
use stop words is a corpus is usually a form of golden egg.  You might be
able to reindex it, but the cost is usually not free.  It is generally
better to have an honest index and allow the post analysis to change.  This
way you can change it 10 times a day and no one will care.

If you are interested in a word cloud I would suspect people have done a
reasonable job around this using a solr index already.

tim

On Fri, May 15, 2020 at 1:48 PM A Adel <aa...@gmail.com> wrote:

> Yes, significant terms have been calculated but they have the anomaly or
> relative shift nature rather than the high frequency, as suggested also by
> the blog post. So, it looks that adding a preprocessing step upstream in an
> additional field makes more sense in this case. The text is intrinsically
> not straightforward to parse (short free text) using mainstream NLP though.
>
> A.
>
> On Fri, May 15, 2020, 8:43 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > Right. I might use NLP to pull out noun phrases and entities. Entities
> are
> > essential noun phrases with proper nouns.
> >
> > Put those in a separate field and build the word cloud from that.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> > dturnbull@opensourceconnections.com> wrote:
> > >
> > > You may want something more like "significant terms" - terms
> > statistically
> > > significant in a document. Possibly not just based on doc freq
> > >
> > > https://saumitra.me/blog/solr-significant-terms/
> > >
> > > On Fri, May 15, 2020 at 2:16 PM A Adel <aa...@gmail.com> wrote:
> > >
> > >> Hi Walter,
> > >>
> > >> Thank you for your explanation, I understand the point and agree with
> > you.
> > >> However, the use case at hand is building a word cloud based on
> faceting
> > >> the multilingual text field (very simple) which in case of not using
> > stop
> > >> words returns many generic terms, articles, etc. If stop words filter
> is
> > >> not used, is there any other/better technique to be used instead to
> > build a
> > >> meaningful word cloud?
> > >>
> > >>
> > >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wunder@wunderwood.org
> >
> > >> wrote:
> > >>
> > >>> Just don’t use stop words. That will give much better relevance and
> > works
> > >>> for all languages.
> > >>>
> > >>> Stop words are an obsolete hack from the days of search engines
> running
> > >>> on 16 bit CPUs. They save space by throwing away important
> information.
> > >>>
> > >>> The classic example is “to be or not to be”, which is made up
> entirely
> > of
> > >>> stop words. Remove them and it is impossible to search for that
> phrase.
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wunder@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>> On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> > >>>>
> > >>>> Hi - Is there a way to configure stop words to be dynamic for each
> > >>> document
> > >>>> based on the language detected of a multilingual text field?
> Combining
> > >>> all
> > >>>> languages stop words in one set is a possibility however it
> introduces
> > >>>> false positives for some language combinations, such as German and
> > >>> English.
> > >>>> Thanks, A.
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > *Doug Turnbull **| CTO* | OpenSource Connections
> > > <http://opensourceconnections.com>, LLC | 240.476.9983
> > > Author: Relevant Search <http://manning.com/turnbull>; Contributor:
> *AI
> > > Powered Search <http://aipoweredsearch.com>*
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> >
> >
>

Re: Dynamic Stopwords

Posted by A Adel <aa...@gmail.com>.
Yes, significant terms have been calculated but they have the anomaly or
relative shift nature rather than the high frequency, as suggested also by
the blog post. So, it looks that adding a preprocessing step upstream in an
additional field makes more sense in this case. The text is intrinsically
not straightforward to parse (short free text) using mainstream NLP though.

A.

On Fri, May 15, 2020, 8:43 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel <aa...@gmail.com> wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wu...@wunderwood.org>
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> >>>>
> >>>> Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
> >>>> based on the language detected of a multilingual text field? Combining
> >>> all
> >>>> languages stop words in one set is a possibility however it introduces
> >>>> false positives for some language combinations, such as German and
> >>> English.
> >>>> Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > <http://opensourceconnections.com>, LLC | 240.476.9983
> > Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> > Powered Search <http://aipoweredsearch.com>*
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Re: Dynamic Stopwords

Posted by Tim Casey <tc...@gmail.com>.
You do not need stop words to do what you need to do,  For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).

In order to do what you want, you really need to look at the statistical
value of all of the symbols in the universe of consideration.  Find the
relevant terms, throw out common terms and anything with a frequency below
5.  This is also language independent, and slang independent and fairly
medium independent.  If you need a more refined space, you can build the
symbol space from bigrams.

If I ever write a book the title is going to be "The The".  I hope it has
multi-lingual translations.  Although, at this point, it is a very short
book :/

tim

On Fri, May 15, 2020 at 11:43 AM Walter Underwood <wu...@wunderwood.org>
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel <aa...@gmail.com> wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wu...@wunderwood.org>
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> >>>>
> >>>> Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
> >>>> based on the language detected of a multilingual text field? Combining
> >>> all
> >>>> languages stop words in one set is a possibility however it introduces
> >>>> false positives for some language combinations, such as German and
> >>> English.
> >>>> Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > <http://opensourceconnections.com>, LLC | 240.476.9983
> > Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> > Powered Search <http://aipoweredsearch.com>*
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Re: Dynamic Stopwords

Posted by Walter Underwood <wu...@wunderwood.org>.
Right. I might use NLP to pull out noun phrases and entities. Entities are essential noun phrases with proper nouns.

Put those in a separate field and build the word cloud from that.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 15, 2020, at 11:39 AM, Doug Turnbull <dt...@opensourceconnections.com> wrote:
> 
> You may want something more like "significant terms" - terms statistically
> significant in a document. Possibly not just based on doc freq
> 
> https://saumitra.me/blog/solr-significant-terms/
> 
> On Fri, May 15, 2020 at 2:16 PM A Adel <aa...@gmail.com> wrote:
> 
>> Hi Walter,
>> 
>> Thank you for your explanation, I understand the point and agree with you.
>> However, the use case at hand is building a word cloud based on faceting
>> the multilingual text field (very simple) which in case of not using stop
>> words returns many generic terms, articles, etc. If stop words filter is
>> not used, is there any other/better technique to be used instead to build a
>> meaningful word cloud?
>> 
>> 
>> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wu...@wunderwood.org>
>> wrote:
>> 
>>> Just don’t use stop words. That will give much better relevance and works
>>> for all languages.
>>> 
>>> Stop words are an obsolete hack from the days of search engines running
>>> on 16 bit CPUs. They save space by throwing away important information.
>>> 
>>> The classic example is “to be or not to be”, which is made up entirely of
>>> stop words. Remove them and it is impossible to search for that phrase.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
>>>> 
>>>> Hi - Is there a way to configure stop words to be dynamic for each
>>> document
>>>> based on the language detected of a multilingual text field? Combining
>>> all
>>>> languages stop words in one set is a possibility however it introduces
>>>> false positives for some language combinations, such as German and
>>> English.
>>>> Thanks, A.
>>> 
>>> 
>> 
> 
> 
> -- 
> *Doug Turnbull **| CTO* | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
> Powered Search <http://aipoweredsearch.com>*
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.


Re: Dynamic Stopwords

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
You may want something more like "significant terms" - terms statistically
significant in a document. Possibly not just based on doc freq

https://saumitra.me/blog/solr-significant-terms/

On Fri, May 15, 2020 at 2:16 PM A Adel <aa...@gmail.com> wrote:

> Hi Walter,
>
> Thank you for your explanation, I understand the point and agree with you.
> However, the use case at hand is building a word cloud based on faceting
> the multilingual text field (very simple) which in case of not using stop
> words returns many generic terms, articles, etc. If stop words filter is
> not used, is there any other/better technique to be used instead to build a
> meaningful word cloud?
>
>
> On Fri, May 15, 2020, 5:20 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > Just don’t use stop words. That will give much better relevance and works
> > for all languages.
> >
> > Stop words are an obsolete hack from the days of search engines running
> > on 16 bit CPUs. They save space by throwing away important information.
> >
> > The classic example is “to be or not to be”, which is made up entirely of
> > stop words. Remove them and it is impossible to search for that phrase.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> > >
> > > Hi - Is there a way to configure stop words to be dynamic for each
> > document
> > > based on the language detected of a multilingual text field? Combining
> > all
> > > languages stop words in one set is a possibility however it introduces
> > > false positives for some language combinations, such as German and
> > English.
> > > Thanks, A.
> >
> >
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Dynamic Stopwords

Posted by A Adel <aa...@gmail.com>.
Hi Walter,

Thank you for your explanation, I understand the point and agree with you.
However, the use case at hand is building a word cloud based on faceting
the multilingual text field (very simple) which in case of not using stop
words returns many generic terms, articles, etc. If stop words filter is
not used, is there any other/better technique to be used instead to build a
meaningful word cloud?


On Fri, May 15, 2020, 5:20 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Just don’t use stop words. That will give much better relevance and works
> for all languages.
>
> Stop words are an obsolete hack from the days of search engines running
> on 16 bit CPUs. They save space by throwing away important information.
>
> The classic example is “to be or not to be”, which is made up entirely of
> stop words. Remove them and it is impossible to search for that phrase.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> >
> > Hi - Is there a way to configure stop words to be dynamic for each
> document
> > based on the language detected of a multilingual text field? Combining
> all
> > languages stop words in one set is a possibility however it introduces
> > false positives for some language combinations, such as German and
> English.
> > Thanks, A.
>
>

Re: Dynamic Stopwords

Posted by Walter Underwood <wu...@wunderwood.org>.
Just don’t use stop words. That will give much better relevance and works
for all languages.

Stop words are an obsolete hack from the days of search engines running 
on 16 bit CPUs. They save space by throwing away important information.

The classic example is “to be or not to be”, which is made up entirely of
stop words. Remove them and it is impossible to search for that phrase.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 14, 2020, at 10:47 PM, A Adel <aa...@gmail.com> wrote:
> 
> Hi - Is there a way to configure stop words to be dynamic for each document
> based on the language detected of a multilingual text field? Combining all
> languages stop words in one set is a possibility however it introduces
> false positives for some language combinations, such as German and English.
> Thanks, A.