You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Demian Katz <de...@villanova.edu> on 2011/04/20 20:01:43 UTC

Bug in solr.KeywordMarkerFilterFactory?

I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior.  It seems that every word subsequent to a protected word is also treated as being protected.

For testing purposes, I have put the word "spelling" in my protwords.txt.  If I do a test for "spelling bees" in the analyze tool, the stemmer produces "spelling bees" - nothing is stemmed.  But if I do a test for "bees spelling", I get "bee spelling", the expected result with "bees" stemmed but "spelling" left unstemmed.  I have tried extended examples - in every case I tried, all of the words prior to "spelling" get stemmed, but none of the words after "spelling" get stemmed.  When turning on the verbose mode of the analyze tool, I can see that the settings of the "keyword" attribute introduced by solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so I think the solr.KeywordMarkerFilterFactory component is to blame, and not anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

thanks,
Demian

RE: Bug in solr.KeywordMarkerFilterFactory?

Posted by Demian Katz <de...@villanova.edu>.

That's good news -- thanks for the help (not to mention the reassurance that Solr itself is actually working right)!  Hopefully 3.1.1 won't be too far off, though; when the analysis tool lies, life can get very confusing! :-)

- Demian

> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, April 20, 2011 2:54 PM
> To: solr-user@lucene.apache.org; yonik@lucidimagination.com
> Subject: Re: Bug in solr.KeywordMarkerFilterFactory?
> 
> No, this is only a bug in analysis.jsp.
> 
> you can see this by comparing analysis.jsp's "dontstems bees" to using
> the query debug interface:
> <lst name="debug">
>   <str name="rawquerystring">"dontstems bees"</str>
>   <str name="querystring">"dontstems bees"</str>
>   <str name="parsedquery">PhraseQuery(text:"dontstems bee")</str>
>   <str name="parsedquery_toString">text:"dontstems bee"</str>
> 
> On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
> > On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
> <de...@villanova.edu> wrote:
> >> I've just started experimenting with the
> solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
> strange behavior.  It seems that every word subsequent to a protected
> word is also treated as being protected.
> >
> > You're right!  This was broken by LUCENE-2901 back in Jan.
> > I've opened this issue:
>  https://issues.apache.org/jira/browse/LUCENE-3039
> >
> > The easiest short-term workaround for you would probably be to create
> > a custom filter that looks like KeywordMarkerFilter before the
> > LUCENE-2901 change.
> >
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >

Re: Bug in solr.KeywordMarkerFilterFactory?

Posted by Robert Muir <rc...@gmail.com>.

No, this is only a bug in analysis.jsp.

you can see this by comparing analysis.jsp's "dontstems bees" to using
the query debug interface:
<lst name="debug">
  <str name="rawquerystring">"dontstems bees"</str>
  <str name="querystring">"dontstems bees"</str>
  <str name="parsedquery">PhraseQuery(text:"dontstems bee")</str>
  <str name="parsedquery_toString">text:"dontstems bee"</str>

On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz <de...@villanova.edu> wrote:
>> I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior.  It seems that every word subsequent to a protected word is also treated as being protected.
>
> You're right!  This was broken by LUCENE-2901 back in Jan.
> I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039
>
> The easiest short-term workaround for you would probably be to create
> a custom filter that looks like KeywordMarkerFilter before the
> LUCENE-2901 change.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: Bug in solr.KeywordMarkerFilterFactory?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz <de...@villanova.edu> wrote:
> I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior.  It seems that every word subsequent to a protected word is also treated as being protected.

You're right!  This was broken by LUCENE-2901 back in Jan.
I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039

The easiest short-term workaround for you would probably be to create
a custom filter that looks like KeywordMarkerFilter before the
LUCENE-2901 change.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco