You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bruno Aranda <br...@gmail.com> on 2009/03/12 13:28:29 UTC
How to remove stemming from the analyzer - Finding "blah" when
searching for "blah*"
Hi,
I am trying to disable stemming from the analyzer, but I am not sure how to
do it.
For instance, I have a field that contains "blah", but when I search for
"blah*" it cannot find it, whereas if I search for "bla*" it does. I was
using the text type field, from the example schema.xml. How should I modify
it so that stemming is not done and I can find "blah" when I search for
"blah*"?
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I have tried using the "textTight" type to no avail. Most of the fields in
my documents have this structure:
DOC1 field> gene name:brca2
DOC2 field> gene name:brca23
If I searched for "brca2*" I would like to find both documents. My field
values normally contain colons ':' that should be used as stop words.
Thank you in advance,
Bruno
Re: How to remove stemming from the analyzer - Finding "blah" when
searching for "blah*"
Posted by Bruno Aranda <br...@gmail.com>.
Thank you! Next time I will remind not to change the words to make the
example simpler...
blah is not the same as Nefh :-)
Thanks,
Bruno
2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
>
> On Mar 12, 2009, at 10:47 AM, Bruno Aranda wrote:
>
>> Doing this query:
>>
>> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh
>>
>> Find 1 result. The term "Nefh" is found in the field "mitab".
>>
>> Doing:
>>
>> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*
>>
>> Finds nothing.
>>
>> I have realised that Ne* of Nef* do not return results as well, using the
>> textIntact type...
>>
>
> Ah... the problem is that wildcarded query terms do not get analyzed, nor
> do they get lowercased (this is an open issue with Solr to at least make
> lowercasing configurable, Lucene supports it).
>
> Try lowercasing in your query client, that should do the trick.
>
> Erik
>
>
Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 12, 2009, at 10:47 AM, Bruno Aranda wrote:
> Doing this query:
>
> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh
>
> Find 1 result. The term "Nefh" is found in the field "mitab".
>
> Doing:
>
> http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*
>
> Finds nothing.
>
> I have realised that Ne* of Nef* do not return results as well,
> using the
> textIntact type...
Ah... the problem is that wildcarded query terms do not get analyzed,
nor do they get lowercased (this is an open issue with Solr to at
least make lowercasing configurable, Lucene supports it).
Try lowercasing in your query client, that should do the trick.
Erik
Re: How to remove stemming from the analyzer - Finding "blah" when
searching for "blah*"
Posted by Bruno Aranda <br...@gmail.com>.
Thanks again. This is the default request handler:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
</requestHandler>
Doing this query:
http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh
Find 1 result. The term "Nefh" is found in the field "mitab".
Doing:
http://localhost:18080/solr/core_pub/select/?q=mitab:Nefh*
Finds nothing.
I have realised that Ne* of Nef* do not return results as well, using the
textIntact type...
Thank you,
Bruno
2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
> What is the full query you're issuing to Solr and the corresponding request
> handler configuration?
>
> Chances are you're using the dismax query parser, which does not support
> wildcards. Other things to check, be sure you've tied the field to your new
> textIntact type, and that you're searching that field (see defaultField in
> schema.xml).
>
> Try something like /solr/select?q=field_name:blah*
>
>
> Erik
>
> On Mar 12, 2009, at 9:09 AM, Bruno Aranda wrote:
>
> Thanks for your answer, I am trying now with this custom text field:
>>
>> <fieldType name="textIntact" class="solr.TextField"
>> positionIncrementGap="100" >
>> <analyzer>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="0"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> expand="0" splitOnCaseChange="0"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> And still it does not find "blah" when using the wildcard and searching
>> for
>> "blah*". Am I missing something?
>>
>> Thanks,
>>
>> Bruno
>>
>> 2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
>>
>> Remove the EnglishPorterFilterFactory from your "text" analyzer
>>> configuration (both index and query sides). And reindex all documents.
>>>
>>> Erik
>>>
>>>
>>> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> I am trying to disable stemming from the analyzer, but I am not sure how
>>>> to
>>>> do it.
>>>>
>>>> For instance, I have a field that contains "blah", but when I search for
>>>> "blah*" it cannot find it, whereas if I search for "bla*" it does. I was
>>>> using the text type field, from the example schema.xml. How should I
>>>> modify
>>>> it so that stemming is not done and I can find "blah" when I search for
>>>> "blah*"?
>>>>
>>>> <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>> <analyzer type="index">
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> <!-- in this example, we will only use synonyms at query time
>>>> <filter class="solr.SynonymFilterFactory"
>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>> -->
>>>> <!-- Case insensitive stop word removal.
>>>> add enablePositionIncrements=true in both the index and query
>>>> analyzers to leave a 'gap' for more accurate phrase queries.
>>>> -->
>>>> <filter class="solr.StopFilterFactory"
>>>> ignoreCase="true"
>>>> words="stopwords.txt"
>>>> enablePositionIncrements="true"
>>>> />
>>>> <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/>
>>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>> <filter class="solr.StopFilterFactory"
>>>> ignoreCase="true"
>>>> words="stopwords.txt"
>>>> enablePositionIncrements="true"
>>>> />
>>>> <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/>
>>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> I have tried using the "textTight" type to no avail. Most of the fields
>>>> in
>>>> my documents have this structure:
>>>>
>>>> DOC1 field> gene name:brca2
>>>> DOC2 field> gene name:brca23
>>>>
>>>> If I searched for "brca2*" I would like to find both documents. My field
>>>> values normally contain colons ':' that should be used as stop words.
>>>>
>>>> Thank you in advance,
>>>>
>>>> Bruno
>>>>
>>>>
>>>
>>>
>
Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
What is the full query you're issuing to Solr and the corresponding
request handler configuration?
Chances are you're using the dismax query parser, which does not
support wildcards. Other things to check, be sure you've tied the
field to your new textIntact type, and that you're searching that
field (see defaultField in schema.xml).
Try something like /solr/select?q=field_name:blah*
Erik
On Mar 12, 2009, at 9:09 AM, Bruno Aranda wrote:
> Thanks for your answer, I am trying now with this custom text field:
>
> <fieldType name="textIntact" class="solr.TextField"
> positionIncrementGap="100" >
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> expand="0" splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> And still it does not find "blah" when using the wildcard and
> searching for
> "blah*". Am I missing something?
>
> Thanks,
>
> Bruno
>
> 2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
>
>> Remove the EnglishPorterFilterFactory from your "text" analyzer
>> configuration (both index and query sides). And reindex all
>> documents.
>>
>> Erik
>>
>>
>> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>>
>> Hi,
>>>
>>> I am trying to disable stemming from the analyzer, but I am not
>>> sure how
>>> to
>>> do it.
>>>
>>> For instance, I have a field that contains "blah", but when I
>>> search for
>>> "blah*" it cannot find it, whereas if I search for "bla*" it does.
>>> I was
>>> using the text type field, from the example schema.xml. How should I
>>> modify
>>> it so that stemming is not done and I can find "blah" when I
>>> search for
>>> "blah*"?
>>>
>>> <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <!-- in this example, we will only use synonyms at query time
>>> <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>> -->
>>> <!-- Case insensitive stop word removal.
>>> add enablePositionIncrements=true in both the index and query
>>> analyzers to leave a 'gap' for more accurate phrase queries.
>>> -->
>>> <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>> words="stopwords.txt"
>>> enablePositionIncrements="true"
>>> />
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>> <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>> words="stopwords.txt"
>>> enablePositionIncrements="true"
>>> />
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> I have tried using the "textTight" type to no avail. Most of the
>>> fields in
>>> my documents have this structure:
>>>
>>> DOC1 field> gene name:brca2
>>> DOC2 field> gene name:brca23
>>>
>>> If I searched for "brca2*" I would like to find both documents. My
>>> field
>>> values normally contain colons ':' that should be used as stop
>>> words.
>>>
>>> Thank you in advance,
>>>
>>> Bruno
>>>
>>
>>
Re: How to remove stemming from the analyzer - Finding "blah" when
searching for "blah*"
Posted by Bruno Aranda <br...@gmail.com>.
Thanks for your answer, I am trying now with this custom text field:
<fieldType name="textIntact" class="solr.TextField"
positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="0"
expand="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
And still it does not find "blah" when using the wildcard and searching for
"blah*". Am I missing something?
Thanks,
Bruno
2009/3/12 Erik Hatcher <er...@ehatchersolutions.com>
> Remove the EnglishPorterFilterFactory from your "text" analyzer
> configuration (both index and query sides). And reindex all documents.
>
> Erik
>
>
> On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
>
> Hi,
>>
>> I am trying to disable stemming from the analyzer, but I am not sure how
>> to
>> do it.
>>
>> For instance, I have a field that contains "blah", but when I search for
>> "blah*" it cannot find it, whereas if I search for "bla*" it does. I was
>> using the text type field, from the example schema.xml. How should I
>> modify
>> it so that stemming is not done and I can find "blah" when I search for
>> "blah*"?
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <!-- in this example, we will only use synonyms at query time
>> <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>> -->
>> <!-- Case insensitive stop word removal.
>> add enablePositionIncrements=true in both the index and query
>> analyzers to leave a 'gap' for more accurate phrase queries.
>> -->
>> <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>> words="stopwords.txt"
>> enablePositionIncrements="true"
>> />
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>> <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>> words="stopwords.txt"
>> enablePositionIncrements="true"
>> />
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> I have tried using the "textTight" type to no avail. Most of the fields in
>> my documents have this structure:
>>
>> DOC1 field> gene name:brca2
>> DOC2 field> gene name:brca23
>>
>> If I searched for "brca2*" I would like to find both documents. My field
>> values normally contain colons ':' that should be used as stop words.
>>
>> Thank you in advance,
>>
>> Bruno
>>
>
>
Re: How to remove stemming from the analyzer - Finding "blah" when searching for "blah*"
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Remove the EnglishPorterFilterFactory from your "text" analyzer
configuration (both index and query sides). And reindex all documents.
Erik
On Mar 12, 2009, at 8:28 AM, Bruno Aranda wrote:
> Hi,
>
> I am trying to disable stemming from the analyzer, but I am not sure
> how to
> do it.
>
> For instance, I have a field that contains "blah", but when I search
> for
> "blah*" it cannot find it, whereas if I search for "bla*" it does. I
> was
> using the text type field, from the example schema.xml. How should I
> modify
> it so that stemming is not done and I can find "blah" when I search
> for
> "blah*"?
>
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> -->
> <!-- Case insensitive stop word removal.
> add enablePositionIncrements=true in both the index and query
> analyzers to leave a 'gap' for more accurate phrase queries.
> -->
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> I have tried using the "textTight" type to no avail. Most of the
> fields in
> my documents have this structure:
>
> DOC1 field> gene name:brca2
> DOC2 field> gene name:brca23
>
> If I searched for "brca2*" I would like to find both documents. My
> field
> values normally contain colons ':' that should be used as stop words.
>
> Thank you in advance,
>
> Bruno