You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by cjkadakia <cj...@sonicbids.com> on 2010/03/08 17:05:27 UTC

Wildcard question -- case issue

I'm encountering a potential bug in Solr regarding wildcards. I have two
fields defined thusly:

    <!-- A general unstemmed text field - good if one does not know the
language of the field -->
    <fieldType name="textgen" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


and 

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

When searching with wildcards I get the following behavior.

Two Documents in the index are named "CMJ foo bar" and "CME foo bar"

The name field has been indexed twice as "name" and "namesimple"

query:

spell?q=name:(cm*) OR namesimple:(cm*)

returns:
CMJ foo bar
CME foo bar

spell?q=name:(CM*) OR namesimple:(CM*)
returns
No results.

I added a equivalent synonym for "cmj,CMJ" and re-indexed

spell?q=name:(CM*) OR namesimple:(CM*)
returns
CMJ foo bar

Naturally I can't see the value or practical use of adding each of these as
they get reported by users and the documentation I've read (as well as
feedback I received on these forums) I've found stemming can interfere with
wildcards during query and indexing, which is why the namesimple field is of
type "textgen." This solved other wildcard/case issues, but this one
remains.

Any suggestions would be appreciated. Thanks!
-- 
View this message in context: http://old.nabble.com/Wildcard-question----case-issue-tp27823332p27823332.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard question -- case issue

Posted by cjkadakia <cj...@sonicbids.com>.

Understood. My solution was to convert any search terms with an asterisk to
lowercase prior to submitting to solr and it seems to be working correctly
now. Thanks for your help.
-- 
View this message in context: http://old.nabble.com/Wildcard-question----case-issue-tp27823332p27836740.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard question -- case issue

Posted by Ahmet Arslan <io...@yahoo.com>.

> query:
> 
> spell?q=name:(cm*) OR namesimple:(cm*)
> 
> returns:
> CMJ foo bar
> CME foo bar
> 
> spell?q=name:(CM*) OR namesimple:(CM*)
> returns
> No results.

"Wildcard queries are not analyzed by Lucene and hence the behavior. [1]
[1]http://www.search-lucene.com/m?id=4A8CE9B2.2070009@ait.co.at||wildcard%20not%20analyzed