You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Book <ch...@gmail.com> on 2012/11/21 04:55:34 UTC

SynonymFilterFactory breaking WordDelimiterFilterFactory output

Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
a problem with a specific query.  When I search for "8mile" or 8-mile"
without the quotes, and I use just the WordDelimiterFilterFactory as
configured below, I get this query which is as expected: album:"(8mile 8)
mile"

But when I also add in the SynonymFilterFactory config listed below, I get
this query instead: album:"8mile eight mile".  In my test the only contents
of synonyms.txt is 8=>eight.  The issue with the 2nd query is the brackets
are removed so it now seems to require all 3 terms as a phrase.

So why does WordDelimitorFilterFactory generate the query I want with both
original and split phrases, but when the number 8 is replaced with eight,
that data is lost and I end up with a phrase that will cause no results to
be found?

This was part of a test case I have that I believe this used to work on
1.4.1 but I still have to confirm.


    <fieldType name="text_title" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                preserveOriginal="1"
                splitOnCaseChange="1"
                protected="protwords.txt"
                />
        <filter class="solr.SynonymFilterFactory"
                synonyms="synonyms.txt"
                ignoreCase="true"
                expand="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="1"
                splitOnCaseChange="1"
                protected="protwords.txt"
                />
        <filter class="solr.SynonymFilterFactory"
                synonyms="synonyms.txt"
                ignoreCase="true"
                expand="true"/>
      </analyzer>
    </fieldType>

Note that I have updated by schema version to 1.5 and my luceneMatchVersion
to LUCENE_36.

Thanks,
Chris

Re: SynonymFilterFactory breaking WordDelimiterFilterFactory output

Posted by Erick Erickson <er...@gmail.com>.

Best advice here is to look hard at admin/analysis and see.

But a couple of notes:
1> it's usually unnecessary to include the exact same synonyms in both
query and index time chains. Index-time is preferred.

2> putting lowercasefilter in front of worddelimiterfilter is going to
break wdff _if_ you intend camel-case to produce multiple tokens.

3> brackets? did you mean parentheses? If so I suspect your issue is in
your request handler not your analysis chain. Perhaps something with
autogeneratephrase?

Providing the &debugQuery=true output would help.

Best
Erick


On Tue, Nov 20, 2012 at 10:55 PM, Chris Book <ch...@gmail.com> wrote:

> Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
> a problem with a specific query.  When I search for "8mile" or 8-mile"
> without the quotes, and I use just the WordDelimiterFilterFactory as
> configured below, I get this query which is as expected: album:"(8mile 8)
> mile"
>
> But when I also add in the SynonymFilterFactory config listed below, I get
> this query instead: album:"8mile eight mile".  In my test the only contents
> of synonyms.txt is 8=>eight.  The issue with the 2nd query is the brackets
> are removed so it now seems to require all 3 terms as a phrase.
>
> So why does WordDelimitorFilterFactory generate the query I want with both
> original and split phrases, but when the number 8 is replaced with eight,
> that data is lost and I end up with a phrase that will cause no results to
> be found?
>
> This was part of a test case I have that I believe this used to work on
> 1.4.1 but I still have to confirm.
>
>
>     <fieldType name="text_title" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="1"
>                 catenateNumbers="1"
>                 catenateAll="0"
>                 preserveOriginal="1"
>                 splitOnCaseChange="1"
>                 protected="protwords.txt"
>                 />
>         <filter class="solr.SynonymFilterFactory"
>                 synonyms="synonyms.txt"
>                 ignoreCase="true"
>                 expand="true"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="0"
>                 preserveOriginal="1"
>                 splitOnCaseChange="1"
>                 protected="protwords.txt"
>                 />
>         <filter class="solr.SynonymFilterFactory"
>                 synonyms="synonyms.txt"
>                 ignoreCase="true"
>                 expand="true"/>
>       </analyzer>
>     </fieldType>
>
> Note that I have updated by schema version to 1.5 and my luceneMatchVersion
> to LUCENE_36.
>
> Thanks,
> Chris
>

Re: SynonymFilterFactory breaking WordDelimiterFilterFactory output

Posted by Yonik Seeley <yo...@lucidworks.com>.

Sounds like perhaps the SynonymFilter is losing the positionIncrement
of 0 (which make the first two tokens overlap)?
You could perhaps verify with the analysis debugging on the admin page.

-Yonik
http://lucidworks.com


On Tue, Nov 20, 2012 at 10:55 PM, Chris Book <ch...@gmail.com> wrote:
> Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
> a problem with a specific query.  When I search for "8mile" or 8-mile"
> without the quotes, and I use just the WordDelimiterFilterFactory as
> configured below, I get this query which is as expected: album:"(8mile 8)
> mile"
>
> But when I also add in the SynonymFilterFactory config listed below, I get
> this query instead: album:"8mile eight mile".  In my test the only contents
> of synonyms.txt is 8=>eight.  The issue with the 2nd query is the brackets
> are removed so it now seems to require all 3 terms as a phrase.
>
> So why does WordDelimitorFilterFactory generate the query I want with both
> original and split phrases, but when the number 8 is replaced with eight,
> that data is lost and I end up with a phrase that will cause no results to
> be found?
>
> This was part of a test case I have that I believe this used to work on
> 1.4.1 but I still have to confirm.
>
>
>     <fieldType name="text_title" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="1"
>                 catenateNumbers="1"
>                 catenateAll="0"
>                 preserveOriginal="1"
>                 splitOnCaseChange="1"
>                 protected="protwords.txt"
>                 />
>         <filter class="solr.SynonymFilterFactory"
>                 synonyms="synonyms.txt"
>                 ignoreCase="true"
>                 expand="true"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="0"
>                 preserveOriginal="1"
>                 splitOnCaseChange="1"
>                 protected="protwords.txt"
>                 />
>         <filter class="solr.SynonymFilterFactory"
>                 synonyms="synonyms.txt"
>                 ignoreCase="true"
>                 expand="true"/>
>       </analyzer>
>     </fieldType>
>
> Note that I have updated by schema version to 1.5 and my luceneMatchVersion
> to LUCENE_36.
>
> Thanks,
> Chris