You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joe Zhang <sm...@gmail.com> on 2012/12/04 05:10:43 UTC
search behavior on a case-sensitive field
I have a search like this:
<fieldType name="text_cs" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<!-- <filter class="solr.LowerCaseFilterFactory"/> -->
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
When I query "COST", it gives reasonable results (n1);
When I query "CoSt", however, it gives me n2 (>n1) results, and I can't
locate actual occurence of "CoSt" in the docs at all. Can anybody advise?
Re: search behavior on a case-sensitive field
Posted by Joe Zhang <sm...@gmail.com>.
haha, makes perfect sense! Thanks a lot!
On Mon, Dec 3, 2012 at 9:25 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
> "CoSt" was split into two terms and the query parser generated an OR of
> them. Adding the autoGeneratePhraseQueries="**true" attribute to your
> field type should fix the problem.
>
> You can also change splitOnCaseChange="1" to splitOnCaseChange="0" to
> avoid the term splitting issue.
>
> Be sure to completely reindex in either case.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Joe Zhang
> Sent: Monday, December 03, 2012 11:10 PM
> To: solr-user@lucene.apache.org
> Subject: search behavior on a case-sensitive field
>
>
> I have a search like this:
>
> <fieldType name="text_cs" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
> <!-- <filter class="solr.**LowerCaseFilterFactory"/> -->
> <filter class="solr.**EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.**RemoveDuplicatesTokenFilterFac**
> tory"/>
> </analyzer>
> </fieldType>
>
> When I query "COST", it gives reasonable results (n1);
> When I query "CoSt", however, it gives me n2 (>n1) results, and I can't
> locate actual occurence of "CoSt" in the docs at all. Can anybody advise?
>
Re: search behavior on a case-sensitive field
Posted by Jack Krupansky <ja...@basetechnology.com>.
"CoSt" was split into two terms and the query parser generated an OR of
them. Adding the autoGeneratePhraseQueries="true" attribute to your field
type should fix the problem.
You can also change splitOnCaseChange="1" to splitOnCaseChange="0" to avoid
the term splitting issue.
Be sure to completely reindex in either case.
-- Jack Krupansky
-----Original Message-----
From: Joe Zhang
Sent: Monday, December 03, 2012 11:10 PM
To: solr-user@lucene.apache.org
Subject: search behavior on a case-sensitive field
I have a search like this:
<fieldType name="text_cs" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<!-- <filter class="solr.LowerCaseFilterFactory"/> -->
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
When I query "COST", it gives reasonable results (n1);
When I query "CoSt", however, it gives me n2 (>n1) results, and I can't
locate actual occurence of "CoSt" in the docs at all. Can anybody advise?