You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Book <ch...@gmail.com> on 2012/11/21 04:55:34 UTC
SynonymFilterFactory breaking WordDelimiterFilterFactory output
Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
a problem with a specific query. When I search for "8mile" or 8-mile"
without the quotes, and I use just the WordDelimiterFilterFactory as
configured below, I get this query which is as expected: album:"(8mile 8)
mile"
But when I also add in the SynonymFilterFactory config listed below, I get
this query instead: album:"8mile eight mile". In my test the only contents
of synonyms.txt is 8=>eight. The issue with the 2nd query is the brackets
are removed so it now seems to require all 3 terms as a phrase.
So why does WordDelimitorFilterFactory generate the query I want with both
original and split phrases, but when the number 8 is replaced with eight,
that data is lost and I end up with a phrase that will cause no results to
be found?
This was part of a test case I have that I believe this used to work on
1.4.1 but I still have to confirm.
<fieldType name="text_title" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1"
splitOnCaseChange="1"
protected="protwords.txt"
/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
splitOnCaseChange="1"
protected="protwords.txt"
/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="true"/>
</analyzer>
</fieldType>
Note that I have updated by schema version to 1.5 and my luceneMatchVersion
to LUCENE_36.
Thanks,
Chris
Re: SynonymFilterFactory breaking WordDelimiterFilterFactory output
Posted by Erick Erickson <er...@gmail.com>.
Best advice here is to look hard at admin/analysis and see.
But a couple of notes:
1> it's usually unnecessary to include the exact same synonyms in both
query and index time chains. Index-time is preferred.
2> putting lowercasefilter in front of worddelimiterfilter is going to
break wdff _if_ you intend camel-case to produce multiple tokens.
3> brackets? did you mean parentheses? If so I suspect your issue is in
your request handler not your analysis chain. Perhaps something with
autogeneratephrase?
Providing the &debugQuery=true output would help.
Best
Erick
On Tue, Nov 20, 2012 at 10:55 PM, Chris Book <ch...@gmail.com> wrote:
> Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
> a problem with a specific query. When I search for "8mile" or 8-mile"
> without the quotes, and I use just the WordDelimiterFilterFactory as
> configured below, I get this query which is as expected: album:"(8mile 8)
> mile"
>
> But when I also add in the SynonymFilterFactory config listed below, I get
> this query instead: album:"8mile eight mile". In my test the only contents
> of synonyms.txt is 8=>eight. The issue with the 2nd query is the brackets
> are removed so it now seems to require all 3 terms as a phrase.
>
> So why does WordDelimitorFilterFactory generate the query I want with both
> original and split phrases, but when the number 8 is replaced with eight,
> that data is lost and I end up with a phrase that will cause no results to
> be found?
>
> This was part of a test case I have that I believe this used to work on
> 1.4.1 but I still have to confirm.
>
>
> <fieldType name="text_title" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1"
> catenateAll="0"
> preserveOriginal="1"
> splitOnCaseChange="1"
> protected="protwords.txt"
> />
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> ignoreCase="true"
> expand="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> preserveOriginal="1"
> splitOnCaseChange="1"
> protected="protwords.txt"
> />
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> ignoreCase="true"
> expand="true"/>
> </analyzer>
> </fieldType>
>
> Note that I have updated by schema version to 1.5 and my luceneMatchVersion
> to LUCENE_36.
>
> Thanks,
> Chris
>
Re: SynonymFilterFactory breaking WordDelimiterFilterFactory output
Posted by Yonik Seeley <yo...@lucidworks.com>.
Sounds like perhaps the SynonymFilter is losing the positionIncrement
of 0 (which make the first two tokens overlap)?
You could perhaps verify with the analysis debugging on the admin page.
-Yonik
http://lucidworks.com
On Tue, Nov 20, 2012 at 10:55 PM, Chris Book <ch...@gmail.com> wrote:
> Hello, I've recently upgraded from Solr 1.4.1 to 3.6.1 and an running into
> a problem with a specific query. When I search for "8mile" or 8-mile"
> without the quotes, and I use just the WordDelimiterFilterFactory as
> configured below, I get this query which is as expected: album:"(8mile 8)
> mile"
>
> But when I also add in the SynonymFilterFactory config listed below, I get
> this query instead: album:"8mile eight mile". In my test the only contents
> of synonyms.txt is 8=>eight. The issue with the 2nd query is the brackets
> are removed so it now seems to require all 3 terms as a phrase.
>
> So why does WordDelimitorFilterFactory generate the query I want with both
> original and split phrases, but when the number 8 is replaced with eight,
> that data is lost and I end up with a phrase that will cause no results to
> be found?
>
> This was part of a test case I have that I believe this used to work on
> 1.4.1 but I still have to confirm.
>
>
> <fieldType name="text_title" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1"
> catenateAll="0"
> preserveOriginal="1"
> splitOnCaseChange="1"
> protected="protwords.txt"
> />
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> ignoreCase="true"
> expand="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> preserveOriginal="1"
> splitOnCaseChange="1"
> protected="protwords.txt"
> />
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> ignoreCase="true"
> expand="true"/>
> </analyzer>
> </fieldType>
>
> Note that I have updated by schema version to 1.5 and my luceneMatchVersion
> to LUCENE_36.
>
> Thanks,
> Chris