You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Charles Hornberger <ch...@gmail.com> on 2007/11/28 18:42:43 UTC

query parsing & wildcards

I'm confused by some behavior I'm seeing in Solr (i'm using 1.2.0). I
have a field named "description", declared with the following
fieldType:

    <fieldType name="textTightUnstemmed" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

The problem I'm having is that when I search for description:deck*, I
get the results I expect; when I search for description:Deck*, I get
nothing. I want both queries to return the same result set. (I'm using
the standard request handler.)

Interestingly, when I search for description:Deck from the web
interface, the debug output shows that the query term is converted to
lowercase:

<str name="rawquerystring">description:Deck</str>
<str name="querystring">description:Deck</str>
<str name="parsedquery">description:deck</str>
<str name="parsedquery_toString">description:deck</str>

... but when I search for description:Deck*, it shows that it is not:

<str name="rawquerystring">description:Deck*</str>
<str name="querystring">description:Deck*</str>
<str name="parsedquery">description:Deck*</str>
<str name="parsedquery_toString">description:Deck*</str>

What am I doing wrong here?

Also, when I use the Field Analysis tool for description:Deck*, it
shows the following (sorry for the bad copy/paste):

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1
term text 	Deck*
term type 	word
source start,end 	0,5
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}
term position 	1
term text 	Deck*
term type 	word
source start,end 	0,5
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 	1
term text 	Deck*
term type 	word
source start,end 	0,5
org.apache.solr.analysis.WordDelimiterFilterFactory
{generateNumberParts=0, catenateWords=1, generateWordParts=0,
catenateAll=0, catenateNumbers=1}
term position 	1
term text 	Deck
term type 	word
source start,end 	0,4
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 	1
term text 	deck
term type 	word
source start,end 	0,4
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 	1
term text 	deck
term type 	word
source start,end 	0,4

Thanks,
Charlie

Re: query parsing & wildcards

Posted by Chris Hostetter <ho...@fucit.org>.

: I should have Googled better. It seems that my question has been asked
: and answered already, and not just once:

right, wildcard and prefix queries aren't analyzed by the query 
parser (there's more on the "why" of this in the Lucene-Java FAQ).

To clarify one other part of your question....

: > Also, when I use the Field Analysis tool for description:Deck*, it
: > shows the following (sorry for the bad copy/paste):

the analysis tool only shows you the "analysis" portion of 
indexing/querying ... it knows nothing about which query parser you are 
using, so it doesn't know anything about any special query parser 
characters (like "*").  The output it gave you shows you want the 
standard request handler would have done if you'd used the standard 
request handler to search for...
         description:"Deck*"
or:      description:Deck\*

(where the * character is 'escaped')



-Hoss

Re: query parsing & wildcards

Posted by Charles Hornberger <ch...@gmail.com>.

I should have Googled better. It seems that my question has been asked
and answered already, and not just once:

  http://www.nabble.com/Using-wildcard-with-accented-words-tf4673239.html
  http://groups.google.com/group/acts_as_solr/browse_thread/thread/42920dc2dcc5fa88

On Nov 28, 2007 9:42 AM, Charles Hornberger
<ch...@gmail.com> wrote:
> I'm confused by some behavior I'm seeing in Solr (i'm using 1.2.0). I
> have a field named "description", declared with the following
> fieldType:
>
>     <fieldType name="textTightUnstemmed" class="solr.TextField"
> positionIncrementGap="100" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> The problem I'm having is that when I search for description:deck*, I
> get the results I expect; when I search for description:Deck*, I get
> nothing. I want both queries to return the same result set. (I'm using
> the standard request handler.)
>
> Interestingly, when I search for description:Deck from the web
> interface, the debug output shows that the query term is converted to
> lowercase:
>
> <str name="rawquerystring">description:Deck</str>
> <str name="querystring">description:Deck</str>
> <str name="parsedquery">description:deck</str>
> <str name="parsedquery_toString">description:deck</str>
>
> ... but when I search for description:Deck*, it shows that it is not:
>
> <str name="rawquerystring">description:Deck*</str>
> <str name="querystring">description:Deck*</str>
> <str name="parsedquery">description:Deck*</str>
> <str name="parsedquery_toString">description:Deck*</str>
>
> What am I doing wrong here?
>
> Also, when I use the Field Analysis tool for description:Deck*, it
> shows the following (sorry for the bad copy/paste):
>
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position   1
> term text       Deck*
> term type       word
> source start,end        0,5
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=false, ignoreCase=true}
> term position   1
> term text       Deck*
> term type       word
> source start,end        0,5
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position   1
> term text       Deck*
> term type       word
> source start,end        0,5
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0, catenateWords=1, generateWordParts=0,
> catenateAll=0, catenateNumbers=1}
> term position   1
> term text       Deck
> term type       word
> source start,end        0,4
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position   1
> term text       deck
> term type       word
> source start,end        0,4
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position   1
> term text       deck
> term type       word
> source start,end        0,4
>
> Thanks,
> Charlie
>