You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Sundling, Paul" <pa...@sonyconnect.com> on 2007/11/02 23:53:28 UTC

default text type and stop words

I noticed very unexpected results when using stop words with and without
conditions using the default text type.
 
A normal query with a stop word returns no results as expected:
 
For example with 'an' being a stopword
 
  movieName:an (results: 0 since it's a stop word) 
  movieName:another (results 237)
 
  rating:PG-13  (results: 76095)
 
 
but if I put them together with AND, for normal non stop words like
'another' the result is less than or equal to the smaller results being
ANDed.  So adding another AND clause with a stop word query should have
0 results.
 
  rating:PG-13 AND movieName:another (results 46)
 
  rating:PG-13 AND movieName:an (results 76095 should be 0)
  
Commenting out the stop word filter from the text type for query will
correct this behavior, although I'm not sure that's a real solution.  So
instead of anding the stop word clause it seems to ignore it.  Even if
the actual problem is at the Lucene level, perhaps it would be worth
considering changes to the default to get around it.
 
Workaround:
 
   <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- comment out to prevent strange behavior <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>-->
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
 
Paul Sundling

Re: default text type and stop words

Posted by Walter Underwood <wu...@netflix.com>.

Stopwords are fairly common in movie titles. There are even titles
made entirely of stopwords. The first one I noticed was "Being There".
I posted more of them here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder
==
Search Guy
Netflix

On 11/2/07 3:53 PM, "Sundling, Paul" <pa...@sonyconnect.com> wrote:

> I noticed very unexpected results when using stop words with and without
> conditions using the default text type.
>  
> A normal query with a stop word returns no results as expected:
>  
> For example with 'an' being a stopword
>  
>   movieName:an (results: 0 since it's a stop word)
>   movieName:another (results 237)
>  
>   rating:PG-13  (results: 76095)
>  
>  
> but if I put them together with AND, for normal non stop words like
> 'another' the result is less than or equal to the smaller results being
> ANDed.  So adding another AND clause with a stop word query should have
> 0 results.
>  
>   rating:PG-13 AND movieName:another (results 46)
>  
>   rating:PG-13 AND movieName:an (results 76095 should be 0)
>   
> Commenting out the stop word filter from the text type for query will
> correct this behavior, although I'm not sure that's a real solution.  So
> instead of anding the stop word clause it seems to ignore it.  Even if
> the actual problem is at the Lucene level, perhaps it would be worth
> considering changes to the default to get around it.
>  
> Workaround:
>  
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <!-- comment out to prevent strange behavior <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>-->
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>  
> Paul Sundling