You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Demian Katz <de...@villanova.edu> on 2011/04/20 20:01:43 UTC
Bug in solr.KeywordMarkerFilterFactory?
I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected.
For testing purposes, I have put the word "spelling" in my protwords.txt. If I do a test for "spelling bees" in the analyze tool, the stemmer produces "spelling bees" - nothing is stemmed. But if I do a test for "bees spelling", I get "bee spelling", the expected result with "bees" stemmed but "spelling" left unstemmed. I have tried extended examples - in every case I tried, all of the words prior to "spelling" get stemmed, but none of the words after "spelling" get stemmed. When turning on the verbose mode of the analyze tool, I can see that the settings of the "keyword" attribute introduced by solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so I think the solr.KeywordMarkerFilterFactory component is to blame, and not anything later in the analyze chain.
Any idea what might be going wrong? Is this a known issue?
Here is my field type definition, in case it makes a difference:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
thanks,
Demian
RE: Bug in solr.KeywordMarkerFilterFactory?
Posted by Demian Katz <de...@villanova.edu>.
That's good news -- thanks for the help (not to mention the reassurance that Solr itself is actually working right)! Hopefully 3.1.1 won't be too far off, though; when the analysis tool lies, life can get very confusing! :-)
- Demian
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, April 20, 2011 2:54 PM
> To: solr-user@lucene.apache.org; yonik@lucidimagination.com
> Subject: Re: Bug in solr.KeywordMarkerFilterFactory?
>
> No, this is only a bug in analysis.jsp.
>
> you can see this by comparing analysis.jsp's "dontstems bees" to using
> the query debug interface:
> <lst name="debug">
> <str name="rawquerystring">"dontstems bees"</str>
> <str name="querystring">"dontstems bees"</str>
> <str name="parsedquery">PhraseQuery(text:"dontstems bee")</str>
> <str name="parsedquery_toString">text:"dontstems bee"</str>
>
> On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
> > On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
> <de...@villanova.edu> wrote:
> >> I've just started experimenting with the
> solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
> strange behavior. It seems that every word subsequent to a protected
> word is also treated as being protected.
> >
> > You're right! This was broken by LUCENE-2901 back in Jan.
> > I've opened this issue:
> https://issues.apache.org/jira/browse/LUCENE-3039
> >
> > The easiest short-term workaround for you would probably be to create
> > a custom filter that looks like KeywordMarkerFilter before the
> > LUCENE-2901 change.
> >
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >
Re: Bug in solr.KeywordMarkerFilterFactory?
Posted by Robert Muir <rc...@gmail.com>.
No, this is only a bug in analysis.jsp.
you can see this by comparing analysis.jsp's "dontstems bees" to using
the query debug interface:
<lst name="debug">
<str name="rawquerystring">"dontstems bees"</str>
<str name="querystring">"dontstems bees"</str>
<str name="parsedquery">PhraseQuery(text:"dontstems bee")</str>
<str name="parsedquery_toString">text:"dontstems bee"</str>
On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz <de...@villanova.edu> wrote:
>> I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected.
>
> You're right! This was broken by LUCENE-2901 back in Jan.
> I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039
>
> The easiest short-term workaround for you would probably be to create
> a custom filter that looks like KeywordMarkerFilter before the
> LUCENE-2901 change.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
Re: Bug in solr.KeywordMarkerFilterFactory?
Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz <de...@villanova.edu> wrote:
> I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected.
You're right! This was broken by LUCENE-2901 back in Jan.
I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039
The easiest short-term workaround for you would probably be to create
a custom filter that looks like KeywordMarkerFilter before the
LUCENE-2901 change.
-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco