You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vikas Kumar <he...@gmail.com> on 2020/03/20 14:52:00 UTC

Weird issues when using synonyms and stopwords together

I have a field title in my solr schema:

<field name="title" type="text_en" termVectors="true" indexed="true"
required="true" stored="true" />

text_en is defined as follows:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100" docValues="false" multiValued="false">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.ASCIIFoldingFilterFactory"
preserveOriginal="true" />
            <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms_en.txt" ignoreCase="true" expand="true" />
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.PorterStemFilterFactory" />
        </analyzer>
    </fieldType>

I'm encountering strange behaviour when using multi-word synonyms which
contain stopwords.

If the stopwords appear in the middle, it works fine. For example, if I
have the following in my synonyms file (where i is a stopword):

iphone, apple i phone

And if I query: /select?q=iphone&qf=title&defType=edismax

The parsed query is: +DisjunctionMaxQuery(((((+title:appl +title:phone)
title:iphon))))

Same for query: /select?q=apple i phone&qf=title&defType=edismax

But if stopwords appear at the start or end, then behaviour is
unpredictable.

In most of the cases, the entire synonym is dropped. For example, if I
change my synonyms file to:

iphone, i phone

and do the same query again (with iphone), I get:

+DisjunctionMaxQuery(((title:iphon)))

I was expecting iphon and phone (as i would be dropped) in my dismax query.

In some cases, behaviour is even more weird.

For example, if my synonyms file is:

between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best

and I have ferns and best as my stopwords. If I do the following query:

/select?q=netflix comedy&qf=title&defType=edismax

I get this:

+DisjunctionMaxQuery((((+title:between +title:two +title:galifianaki
+title:show) (+title:netflix +title:2019 +title:comedi))))

which is kind of a very weird combinations.

I'm not able to understand this behaviour and have not found anything
related to this in documentation or internet. Maybe I'm missing something.
Any help/pointers is highly appreciated.

Solr version: 8.4.1

Re: Weird issues when using synonyms and stopwords together

Posted by Walter Underwood <wu...@wunderwood.org>.

Do not remove stopwords.

Stopword removal was a hack invented for 16-bit machines and multi-megabyte disks.
That hack is not needed now.

tf.idf addresses the same problem as stopwords with a much better algorithm.
Removing stopwords is an on/off decision for a guess at common words.
tf.idf is a proportional weighting of common words based on the statistics of
your documents.

Do not remove stopwords.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 20, 2020, at 7:52 AM, Vikas Kumar <he...@gmail.com> wrote:
> 
> I have a field title in my solr schema:
> 
> <field name="title" type="text_en" termVectors="true" indexed="true"
> required="true" stored="true" />
> 
> text_en is defined as follows:
> 
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100" docValues="false" multiValued="false">
>        <analyzer type="index">
>            <tokenizer class="solr.StandardTokenizerFactory" />
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt" />
>            <filter class="solr.LowerCaseFilterFactory" />
>            <filter class="solr.ASCIIFoldingFilterFactory"
> preserveOriginal="true" />
>            <filter class="solr.PorterStemFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>            <tokenizer class="solr.StandardTokenizerFactory" />
>            <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms_en.txt" ignoreCase="true" expand="true" />
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt" />
>            <filter class="solr.LowerCaseFilterFactory" />
>            <filter class="solr.PorterStemFilterFactory" />
>        </analyzer>
>    </fieldType>
> 
> I'm encountering strange behaviour when using multi-word synonyms which
> contain stopwords.
> 
> If the stopwords appear in the middle, it works fine. For example, if I
> have the following in my synonyms file (where i is a stopword):
> 
> iphone, apple i phone
> 
> And if I query: /select?q=iphone&qf=title&defType=edismax
> 
> The parsed query is: +DisjunctionMaxQuery(((((+title:appl +title:phone)
> title:iphon))))
> 
> Same for query: /select?q=apple i phone&qf=title&defType=edismax
> 
> But if stopwords appear at the start or end, then behaviour is
> unpredictable.
> 
> In most of the cases, the entire synonym is dropped. For example, if I
> change my synonyms file to:
> 
> iphone, i phone
> 
> and do the same query again (with iphone), I get:
> 
> +DisjunctionMaxQuery(((title:iphon)))
> 
> I was expecting iphon and phone (as i would be dropped) in my dismax query.
> 
> In some cases, behaviour is even more weird.
> 
> For example, if my synonyms file is:
> 
> between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best
> 
> and I have ferns and best as my stopwords. If I do the following query:
> 
> /select?q=netflix comedy&qf=title&defType=edismax
> 
> I get this:
> 
> +DisjunctionMaxQuery((((+title:between +title:two +title:galifianaki
> +title:show) (+title:netflix +title:2019 +title:comedi))))
> 
> which is kind of a very weird combinations.
> 
> I'm not able to understand this behaviour and have not found anything
> related to this in documentation or internet. Maybe I'm missing something.
> Any help/pointers is highly appreciated.
> 
> Solr version: 8.4.1