You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alex Sylka <sy...@gmail.com> on 2015/03/31 23:41:20 UTC

Stopwords magic

My stopwords don't works as expected.
Here is part of my schema:
 <fieldType name="text_general" class="solr.TextField">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
 <fieldType class="solr.TextField" name="text_auto">
        <analyzer type="index">
            <charFilter class="solr.HTMLStripCharFilterFactory"/>
            <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true" outputUnigramsIfNoShingles="false"/>
        </analyzer>
        <analyzer type="query">
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false"/>
        </analyzer>
    </fieldType>
 <field name="deal_title_terms" type="text_auto" indexed="true"
stored="false" required="false" multiValued="true"/>
    <field name="deal_description" type="text_general" indexed="true"
stored="true" required="false" multiValued="false"/>
In stopwords.txt I have next words: the, is, a;
Also I have next data in my fields:

deal_description - This is the my description
deal_title_terms - This is the deal title a terms (will be splitted in
terms)

When I try to search deal_description:
Example 1: "deal_description: *his is the m*" - I expect that document with
deal_description "This is the my description" will be returned
Example 2: "deal_description: *is th*" - I expect that nothing will be
found because "is" and "the" are stopwords.

When I try to search deal_title_terms:
Example 1: "deal_title_terms: *is*" - I expect that nothing will be found
because "is" is stopword.
Example 2: "deal_title_terms: *is the deal*" - I expect that "is" and "the"
will be ignored and term "deal" will be found.
Example 3: "deal_title_terms: *title a terms*" - I expect that "a" will be
ignored and term "title terms" will be found.

Question 1: Why stopwords don't works for "deal_description" field ?
Question 2: Why for field "deal_title_terms" stopwords not removed for my
query ?(When I am trying to find *title a terms* it will not find "title
terms" term)
Question 3: Is there any way to show stopwords in search result but prevent
them from searching ? Example:

data: This is cool search engine
search query : "*is coo*" -> return "This is cool search engine"
search query : "*is*" -> return nothing
search query : "*This coll*" -> return "This is cool search engine"

Question 4: *Where I can find detailed description (maybe with examples)
how stopwords works in solr ? Because it looks like magic.*

Re: Stopwords magic

Posted by Jack Krupansky <ja...@gmail.com>.
Use the Solr Admin UI analysis page to see how the text is analyzed at both
index and query time.

My e-book does have more narrative and examples for stop word processing:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

On Tue, Mar 31, 2015 at 5:41 PM, Alex Sylka <sy...@gmail.com> wrote:

> My stopwords don't works as expected.
> Here is part of my schema:
>  <fieldType name="text_general" class="solr.TextField">
>         <analyzer type="index">
>             <tokenizer class="solr.KeywordTokenizerFactory"/>
>             <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.KeywordTokenizerFactory"/>
>             <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>     </fieldType>
>  <fieldType class="solr.TextField" name="text_auto">
>         <analyzer type="index">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>             <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="false"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="true" outputUnigramsIfNoShingles="false"/>
>         </analyzer>
>         <analyzer type="query">
>             <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             <tokenizer class="solr.StandardTokenizerFactory"/>
>             <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="false"/>
>         </analyzer>
>     </fieldType>
>  <field name="deal_title_terms" type="text_auto" indexed="true"
> stored="false" required="false" multiValued="true"/>
>     <field name="deal_description" type="text_general" indexed="true"
> stored="true" required="false" multiValued="false"/>
> In stopwords.txt I have next words: the, is, a;
> Also I have next data in my fields:
>
> deal_description - This is the my description
> deal_title_terms - This is the deal title a terms (will be splitted in
> terms)
>
> When I try to search deal_description:
> Example 1: "deal_description: *his is the m*" - I expect that document with
> deal_description "This is the my description" will be returned
> Example 2: "deal_description: *is th*" - I expect that nothing will be
> found because "is" and "the" are stopwords.
>
> When I try to search deal_title_terms:
> Example 1: "deal_title_terms: *is*" - I expect that nothing will be found
> because "is" is stopword.
> Example 2: "deal_title_terms: *is the deal*" - I expect that "is" and "the"
> will be ignored and term "deal" will be found.
> Example 3: "deal_title_terms: *title a terms*" - I expect that "a" will be
> ignored and term "title terms" will be found.
>
> Question 1: Why stopwords don't works for "deal_description" field ?
> Question 2: Why for field "deal_title_terms" stopwords not removed for my
> query ?(When I am trying to find *title a terms* it will not find "title
> terms" term)
> Question 3: Is there any way to show stopwords in search result but prevent
> them from searching ? Example:
>
> data: This is cool search engine
> search query : "*is coo*" -> return "This is cool search engine"
> search query : "*is*" -> return nothing
> search query : "*This coll*" -> return "This is cool search engine"
>
> Question 4: *Where I can find detailed description (maybe with examples)
> how stopwords works in solr ? Because it looks like magic.*
>