You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bruno Aranda <br...@gmail.com> on 2009/03/12 13:28:29 UTC

How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Hi,

I am trying to disable stemming from the analyzer, but I am not sure how to
do it.

For instance, I have a field that contains "blah", but when I search for
"blah*" it cannot find it, whereas if I search for "bla*" it does. I was
using the text type field, from the example schema.xml. How should I modify
it so that stemming is not done and I can find "blah" when I search for
"blah*"?

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

I have tried using the "textTight" type to no avail. Most of the fields in
my documents have this structure:

DOC1 field> gene name:brca2
DOC2 field> gene name:brca23

If I searched for "brca2*" I would like to find both documents. My field
values normally contain colons ':' that should be used as stop words.

Thank you in advance,

Bruno

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Bruno Aranda <br...@gmail.com>.

Thank you! Next time I will remind not to change the words to make the
example simpler...

blah is not the same as Nefh :-)

Thanks,

Bruno

2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>

>
> On Mar 12, 2009, at 10:47 AM, Bruno Aranda wrote:
>
>> Doing this query:
>>
>> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh
>>
>> Find 1 result. The term "Nefh" is found in the field "mitab".
>>
>> Doing:
>>
>> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*
>>
>> Finds nothing.
>>
>> I have realised that Ne* of Nef* do not return results as well, using the
>> textIntact type...
>>
>
> Ah... the problem is that wildcarded query terms do not get analyzed, nor
> do they get lowercased (this is an open issue with Solr to at least make
> lowercasing configurable, Lucene supports it).
>
> Try lowercasing in your query client, that should do the trick.
>
>        Erik
>
>

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 12, 2009, at 10:47 AM, Bruno Aranda wrote:
> Doing this query:
>
> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh
>
> Find 1 result. The term "Nefh" is found in the field "mitab".
>
> Doing:
>
> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*
>
> Finds nothing.
>
> I have realised that Ne* of Nef* do not return results as well,  
> using the
> textIntact type...

Ah... the problem is that wildcarded query terms do not get analyzed,  
nor do they get lowercased (this is an open issue with Solr to at  
least make lowercasing configurable, Lucene supports it).

Try lowercasing in your query client, that should do the trick.

	Erik

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Bruno Aranda <br...@gmail.com>.

Thanks again. This is the default request handler:

 <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
     </lst>
  </requestHandler>

Doing this query:

http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh

Find 1 result. The term "Nefh" is found in the field "mitab".

Doing:

http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*

Finds nothing.

I have realised that Ne* of Nef* do not return results as well, using the
textIntact type...

Thank you,

Bruno

2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>

> What is the full query you're issuing to Solr and the corresponding request
> handler configuration?
>
> Chances are you're using the dismax query parser, which does not support
> wildcards.  Other things to check, be sure you've tied the field to your new
> textIntact type, and that you're searching that field (see defaultField in
> schema.xml).
>
> Try something like /solr/select?q=field_name:blah*
>
>
>        Erik
>
> On Mar 12, 2009, at 9:09 AM, Bruno Aranda wrote:
>
>  Thanks for your answer, I am trying now with this custom text field:
>>
>> <fieldType name="textIntact" class="solr.TextField"
>> positionIncrementGap="100" >
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="0"
>>               catenateWords="0" catenateNumbers="0" catenateAll="0"
>> expand="0" splitOnCaseChange="0"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>> And still it does not find "blah" when using the wildcard and searching
>> for
>> "blah*". Am I missing something?
>>
>> Thanks,
>>
>> Bruno
>>
>> 2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
>>
>>  Remove the EnglishPorterFilterFactory from your "text" analyzer
>>> configuration (both index and query sides).  And reindex all documents.
>>>
>>>      Erik
>>>
>>>
>>> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> I am trying to disable stemming from the analyzer, but I am not sure how
>>>> to
>>>> do it.
>>>>
>>>> For instance, I have a field that contains "blah", but when I search for
>>>> "blah*" it cannot find it, whereas if I search for "bla*" it does. I was
>>>> using the text type field, from the example schema.xml. How should I
>>>> modify
>>>> it so that stemming is not done and I can find "blah" when I search for
>>>> "blah*"?
>>>>
>>>> <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>   <analyzer type="index">
>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>     <!-- in this example, we will only use synonyms at query time
>>>>     <filter class="solr.SynonymFilterFactory"
>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>     -->
>>>>     <!-- Case insensitive stop word removal.
>>>>       add enablePositionIncrements=true in both the index and query
>>>>       analyzers to leave a 'gap' for more accurate phrase queries.
>>>>     -->
>>>>     <filter class="solr.StopFilterFactory"
>>>>             ignoreCase="true"
>>>>             words="stopwords.txt"
>>>>             enablePositionIncrements="true"
>>>>             />
>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>     <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/>
>>>>     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>   </analyzer>
>>>>   <analyzer type="query">
>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>>     <filter class="solr.StopFilterFactory"
>>>>             ignoreCase="true"
>>>>             words="stopwords.txt"
>>>>             enablePositionIncrements="true"
>>>>             />
>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>     <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/>
>>>>     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>   </analyzer>
>>>>  </fieldType>
>>>>
>>>> I have tried using the "textTight" type to no avail. Most of the fields
>>>> in
>>>> my documents have this structure:
>>>>
>>>> DOC1 field> gene name:brca2
>>>> DOC2 field> gene name:brca23
>>>>
>>>> If I searched for "brca2*" I would like to find both documents. My field
>>>> values normally contain colons ':' that should be used as stop words.
>>>>
>>>> Thank you in advance,
>>>>
>>>> Bruno
>>>>
>>>>
>>>
>>>
>

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

What is the full query you're issuing to Solr and the corresponding  
request handler configuration?

Chances are you're using the dismax query parser, which does not  
support wildcards.  Other things to check, be sure you've tied the  
field to your new textIntact type, and that you're searching that  
field (see defaultField in schema.xml).

Try something like /solr/select?q=field_name:blah*

	Erik

On Mar 12, 2009, at 9:09 AM, Bruno Aranda wrote:

> Thanks for your answer, I am trying now with this custom text field:
>
> <fieldType name="textIntact" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="0"
>                catenateWords="0" catenateNumbers="0" catenateAll="0"
> expand="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> And still it does not find "blah" when using the wildcard and  
> searching for
> "blah*". Am I missing something?
>
> Thanks,
>
> Bruno
>
> 2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
>
>> Remove the EnglishPorterFilterFactory from your "text" analyzer
>> configuration (both index and query sides).  And reindex all  
>> documents.
>>
>>       Erik
>>
>>
>> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>>
>> Hi,
>>>
>>> I am trying to disable stemming from the analyzer, but I am not  
>>> sure how
>>> to
>>> do it.
>>>
>>> For instance, I have a field that contains "blah", but when I  
>>> search for
>>> "blah*" it cannot find it, whereas if I search for "bla*" it does.  
>>> I was
>>> using the text type field, from the example schema.xml. How should I
>>> modify
>>> it so that stemming is not done and I can find "blah" when I  
>>> search for
>>> "blah*"?
>>>
>>> <fieldType name="text" class="solr.TextField"  
>>> positionIncrementGap="100">
>>>    <analyzer type="index">
>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>      <!-- in this example, we will only use synonyms at query time
>>>      <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>      -->
>>>      <!-- Case insensitive stop word removal.
>>>        add enablePositionIncrements=true in both the index and query
>>>        analyzers to leave a 'gap' for more accurate phrase queries.
>>>      -->
>>>      <filter class="solr.StopFilterFactory"
>>>              ignoreCase="true"
>>>              words="stopwords.txt"
>>>              enablePositionIncrements="true"
>>>              />
>>>      <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter class="solr.SynonymFilterFactory"  
>>> synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>      <filter class="solr.StopFilterFactory"
>>>              ignoreCase="true"
>>>              words="stopwords.txt"
>>>              enablePositionIncrements="true"
>>>              />
>>>      <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>    </analyzer>
>>>  </fieldType>
>>>
>>> I have tried using the "textTight" type to no avail. Most of the  
>>> fields in
>>> my documents have this structure:
>>>
>>> DOC1 field> gene name:brca2
>>> DOC2 field> gene name:brca23
>>>
>>> If I searched for "brca2*" I would like to find both documents. My  
>>> field
>>> values normally contain colons ':' that should be used as stop  
>>> words.
>>>
>>> Thank you in advance,
>>>
>>> Bruno
>>>
>>
>>

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Bruno Aranda <br...@gmail.com>.

Thanks for your answer, I am trying now with this custom text field:

<fieldType name="textIntact" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="0"
                catenateWords="0" catenateNumbers="0" catenateAll="0"
expand="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

And still it does not find "blah" when using the wildcard and searching for
"blah*". Am I missing something?

Thanks,

Bruno

2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>

> Remove the EnglishPorterFilterFactory from your "text" analyzer
> configuration (both index and query sides).  And reindex all documents.
>
>        Erik
>
>
> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>
>  Hi,
>>
>> I am trying to disable stemming from the analyzer, but I am not sure how
>> to
>> do it.
>>
>> For instance, I have a field that contains "blah", but when I search for
>> "blah*" it cannot find it, whereas if I search for "bla*" it does. I was
>> using the text type field, from the example schema.xml. How should I
>> modify
>> it so that stemming is not done and I can find "blah" when I search for
>> "blah*"?
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <!-- in this example, we will only use synonyms at query time
>>       <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>       -->
>>       <!-- Case insensitive stop word removal.
>>         add enablePositionIncrements=true in both the index and query
>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>       -->
>>       <filter class="solr.StopFilterFactory"
>>               ignoreCase="true"
>>               words="stopwords.txt"
>>               enablePositionIncrements="true"
>>               />
>>       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.StopFilterFactory"
>>               ignoreCase="true"
>>               words="stopwords.txt"
>>               enablePositionIncrements="true"
>>               />
>>       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>> I have tried using the "textTight" type to no avail. Most of the fields in
>> my documents have this structure:
>>
>> DOC1 field> gene name:brca2
>> DOC2 field> gene name:brca23
>>
>> If I searched for "brca2*" I would like to find both documents. My field
>> values normally contain colons ':' that should be used as stop words.
>>
>> Thank you in advance,
>>
>> Bruno
>>
>
>

Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Remove the EnglishPorterFilterFactory from your "text" analyzer  
configuration (both index and query sides).  And reindex all documents.

	Erik

On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:

> Hi,
>
> I am trying to disable stemming from the analyzer, but I am not sure  
> how to
> do it.
>
> For instance, I have a field that contains "blah", but when I search  
> for
> "blah*" it cannot find it, whereas if I search for "bla*" it does. I  
> was
> using the text type field, from the example schema.xml. How should I  
> modify
> it so that stemming is not done and I can find "blah" when I search  
> for
> "blah*"?
>
> <fieldType name="text" class="solr.TextField"  
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"  
> synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I have tried using the "textTight" type to no avail. Most of the  
> fields in
> my documents have this structure:
>
> DOC1 field> gene name:brca2
> DOC2 field> gene name:brca23
>
> If I searched for "brca2*" I would like to find both documents. My  
> field
> values normally contain colons ':' that should be used as stop words.
>
> Thank you in advance,
>
> Bruno