You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by cjkadakia <cj...@sonicbids.com> on 2010/02/23 00:57:30 UTC

Odd wildcard behavior

I'm getting very odd behavior from a wildcard search.

For example, when I'm searching for docs with a name containing the word
"International" the following occur:

q=name:(inte*) -- found "International"
q=name:(intern*) -- found "International"
q=name:(interna*) -- did not find "International"
q=name:(internat*) -- did not find "International"
.. adding 1 character at a time did not find "International"
q=name:(international*) -- did not find "International"

As indicated, the behavior is quite bizarre and causing issues with our use
and test cases. Is there something I can set for the fieldType of text in
order to make these kinds of searches working? Also, any insight as to why
this is not working would be a big help as well.

Pasted for reference:
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

-- 
View this message in context: http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27695404.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Odd wildcard behavior

Posted by Erick Erickson <er...@gmail.com>.

Several things:
> You're including a stemmer in your field, that'll
transform your indexed terms.
> Have you used the schema browser in the admin
page to take a look at the results of indexing with
stemming?  Luke is also good for this.
> What shows up when you add "&debugQuery=true"
to your search?
>wildcard queries aren't analyzed, the underlying
Lucene index just spins through the terms finding
ones in the index that match and searching on those.
So your implicit (?) assumption that you'll get the
same behavior at index and query time isn't accurate.

HTH
Erick

I'd get a copy of Luke and examine the actual tokens
in the index

On Mon, Feb 22, 2010 at 6:57 PM, cjkadakia <cj...@sonicbids.com> wrote:

>
> I'm getting very odd behavior from a wildcard search.
>
> For example, when I'm searching for docs with a name containing the word
> "International" the following occur:
>
> q=name:(inte*) -- found "International"
> q=name:(intern*) -- found "International"
> q=name:(interna*) -- did not find "International"
> q=name:(internat*) -- did not find "International"
> .. adding 1 character at a time did not find "International"
> q=name:(international*) -- did not find "International"
>
> As indicated, the behavior is quite bizarre and causing issues with our use
> and test cases. Is there something I can set for the fieldType of text in
> order to make these kinds of searches working? Also, any insight as to why
> this is not working would be a big help as well.
>
> Pasted for reference:
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
> --
> View this message in context:
> http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27695404.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Odd wildcard behavior

Posted by cjkadakia <cj...@sonicbids.com>.

Worked exactly as intended. The name field is now indexed as text and the
unstemmed "textgen". I have the submission throwing an OR between both for
any name searches, and voila. Stemming and wildcard searches are in tact.

Thanks!! :)
-- 
View this message in context: http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27704788.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Odd wildcard behavior

Posted by cjkadakia <cj...@sonicbids.com>.

It helps tremendously, Erick, and it was the exact idea I had last night
after reflecting with a nice scotch. :)

I'm planning on indexing the name field as "text" as I have been, and then
indexing it again as "nameLiteral" or something with a field type not
containing stemming. The code submitting to Solr will take any "name" search
and turn it into something like q=(name:(search terms*) OR
nameLiteral:(search terms*)) and hopefully, between the two, the result will
be both requirements preserved.

I'll reply back with the results regardless. Thanks ahead of time, everyone.

-- 
View this message in context: http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27703992.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Odd wildcard behavior

Posted by Erick Erickson <er...@gmail.com>.

Well, the first question I'd ask is whether using the stemmers
produces results your *users* don't expect. Your original e-mail
mentions use and test cases. Can you re-visit the use case
and determine whether your users are better served by working
with the stemming? If so, then it's just a matter of changing your
tests.

But if your use case is required, you could think about copying
the field to a field without the stemmer, and searching on each
field as required. That requires some way to determine what
search clauses should go against what field, which may be
trivial. Or not....

HTH
Erick

On Mon, Feb 22, 2010 at 8:08 PM, cjkadakia <cj...@sonicbids.com> wrote:

>
> If stemming is the underlying issue here, then are there any suggestions?
> Would I have to remove the SnowballPorterFilterFactory from both the index
> AND the query?
>
> Just to clarify, the ability to search on "foos" and return "foo" (and
> vice-versa) is quite important, but this other issue with wildcards is a
> more pressing issue for right now. What would you suggest to handle both
> requirements?
>
> As for the output from the debugQuery, here you go:
>
> <lst name="debug">
> <str name="rawquerystring">(name:(international*))</str>
> <str name="querystring">(name:(international*))</str>
> <str name="parsedquery">name:international*</str>
> <str name="parsedquery_toString">name:international*</str>
> <lst name="explain"/>
> <str name="QParser">LuceneQParser</str>
> -
> <lst name="timing">
> <double name="time">0.0</double>
> -
> <lst name="prepare">
> <double name="time">0.0</double>
> -
> <lst name="org.apache.solr.handler.component.QueryComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.FacetComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.HighlightComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.StatsComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.DebugComponent">
> <double name="time">0.0</double>
> </lst>
> </lst>
> -
> <lst name="process">
> <double name="time">0.0</double>
> -
> <lst name="org.apache.solr.handler.component.QueryComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.FacetComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.HighlightComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.StatsComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
> <double name="time">0.0</double>
> </lst>
> -
> <lst name="org.apache.solr.handler.component.DebugComponent">
> <double name="time">0.0</double>
> </lst>
> </lst>
> </lst>
> </lst>
>
>
> cjkadakia wrote:
> >
> > I'm getting very odd behavior from a wildcard search.
> >
> > For example, when I'm searching for docs with a name containing the word
> > "International" the following occur:
> >
> > q=name:(inte*) -- found "International"
> > q=name:(intern*) -- found "International"
> > q=name:(interna*) -- did not find "International"
> > q=name:(internat*) -- did not find "International"
> > .. adding 1 character at a time did not find "International"
> > q=name:(international*) -- did not find "International"
> >
> > As indicated, the behavior is quite bizarre and causing issues with our
> > use and test cases. Is there something I can set for the fieldType of
> text
> > in order to make these kinds of searches working? Also, any insight as to
> > why this is not working would be a big help as well.
> >
> > Pasted for reference:
> >     <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <!-- in this example, we will only use synonyms at query time
> >         <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >         -->
> >         <!-- Case insensitive stop word removal.
> >           add enablePositionIncrements=true in both the index and query
> >           analyzers to leave a 'gap' for more accurate phrase queries.
> >         -->
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 enablePositionIncrements="true"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 enablePositionIncrements="true"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> >       </analyzer>
> >     </fieldType>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27697228.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Odd wildcard behavior

Posted by cjkadakia <cj...@sonicbids.com>.

If stemming is the underlying issue here, then are there any suggestions?
Would I have to remove the SnowballPorterFilterFactory from both the index
AND the query?

Just to clarify, the ability to search on "foos" and return "foo" (and
vice-versa) is quite important, but this other issue with wildcards is a
more pressing issue for right now. What would you suggest to handle both
requirements?

As for the output from the debugQuery, here you go:

<lst name="debug">
<str name="rawquerystring">(name:(international*))</str>
<str name="querystring">(name:(international*))</str>
<str name="parsedquery">name:international*</str>
<str name="parsedquery_toString">name:international*</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>
−
<lst name="timing">
<double name="time">0.0</double>
−
<lst name="prepare">
<double name="time">0.0</double>
−
<lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.StatsComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.SpellCheckComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">0.0</double>
</lst>
</lst>
−
<lst name="process">
<double name="time">0.0</double>
−
<lst name="org.apache.solr.handler.component.QueryComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.FacetComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.HighlightComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.StatsComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.SpellCheckComponent">
<double name="time">0.0</double>
</lst>
−
<lst name="org.apache.solr.handler.component.DebugComponent">
<double name="time">0.0</double>
</lst>
</lst>
</lst>
</lst>


cjkadakia wrote:
> 
> I'm getting very odd behavior from a wildcard search.
> 
> For example, when I'm searching for docs with a name containing the word
> "International" the following occur:
> 
> q=name:(inte*) -- found "International"
> q=name:(intern*) -- found "International"
> q=name:(interna*) -- did not find "International"
> q=name:(internat*) -- did not find "International"
> .. adding 1 character at a time did not find "International"
> q=name:(international*) -- did not find "International"
> 
> As indicated, the behavior is quite bizarre and causing issues with our
> use and test cases. Is there something I can set for the fieldType of text
> in order to make these kinds of searches working? Also, any insight as to
> why this is not working would be a big help as well.
> 
> Pasted for reference:
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <!-- Case insensitive stop word removal.
>           add enablePositionIncrements=true in both the index and query
>           analyzers to leave a 'gap' for more accurate phrase queries.
>         -->
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>       </analyzer>
>     </fieldType>
> 
> 

-- 
View this message in context: http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27697228.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Odd wildcard behavior

Posted by Robert Muir <rc...@gmail.com>.

porter stemmer turns 'international' into 'intern'

On Mon, Feb 22, 2010 at 6:57 PM, cjkadakia <cj...@sonicbids.com> wrote:

>
> I'm getting very odd behavior from a wildcard search.
>
> For example, when I'm searching for docs with a name containing the word
> "International" the following occur:
>
> q=name:(inte*) -- found "International"
> q=name:(intern*) -- found "International"
> q=name:(interna*) -- did not find "International"
> q=name:(internat*) -- did not find "International"
> .. adding 1 character at a time did not find "International"
> q=name:(international*) -- did not find "International"
>
> As indicated, the behavior is quite bizarre and causing issues with our use
> and test cases. Is there something I can set for the fieldType of text in
> order to make these kinds of searches working? Also, any insight as to why
> this is not working would be a big help as well.
>
> Pasted for reference:
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
> --
> View this message in context:
> http://old.nabble.com/Odd-wildcard-behavior-tp27695404p27695404.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Robert Muir
rcmuir@gmail.com