You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Cesar Ortiz <ce...@gmail.com> on 2012/04/25 18:26:40 UTC

Why a query with a hyphen is not returning a document?

Hi,

I have a problem

I don't understand why when I submit the query 'spider-man' to solr it
doesn't return any document whereas if I submit 'spider man' it does.
Using the analyzer tool it highlights the terms spider and man, so I don't
know what is going on. Does the analyzer showing wrong information?

If I execute the query
http://localhost:8983/solr/select/?q=name%3Aspider-man&version=2.2&start=0&rows=10&indent=on&debugQuery=trueI
get:

<str name="rawquerystring">name:spider-man</str>
<str name="querystring">name:spider-man</str>
<str name="parsedquery">PhraseQuery(name:"spider man")</str>
<str name="parsedquery_toString">name:"spider man"</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>

whereas if I execute
http://localhost:8983/solr/select/?q=name%3Aspider+name%3Aman&version=2.2&start=0&rows=10&indent=on&debugQuery=true
I
get:


<str name="rawquerystring">name:spider name:man</str>
<str name="querystring">name:spider name:man</str>
<str name="parsedquery">+name:spider +name:man</str>
<str name="parsedquery_toString">+name:spider +name:man</str>
<lst name="explain">
<str name="movie-spider-man-2002">
3.0460029 = (MATCH) sum of: 2.123142 = (MATCH) weight(name:spider in 326),
product of: 0.8348806 = queryWeight(name:spider), product of: 8.137755 =
idf(docFreq=4, maxDocs=6293) 0.102593474 = queryNorm 2.5430486 = (MATCH)
fieldWeight(name:spider in 326), product of: 1.0 =
tf(termFreq(name:spider)=1) 8.137755 = idf(docFreq=4, maxDocs=6293) 0.3125
= fieldNorm(field=name, doc=326) 0.92286074 = (MATCH) weight(name:man in
326), product of: 0.5504311 = queryWeight(name:man), product of: 5.3651667
= idf(docFreq=79, maxDocs=6293) 0.102593474 = queryNorm 1.6766145 = (MATCH)
fieldWeight(name:man in 326), product of: 1.0 = tf(termFreq(name:man)=1)
5.3651667 = idf(docFreq=79, maxDocs=6293) 0.3125 = fieldNorm(field=name,
doc=326)
</str>
<str name="movie-spider-man-2-2004-2004">
2.4368021 = (MATCH) sum of: 1.6985135 = (MATCH) weight(name:spider in 46),
product of: 0.8348806 = queryWeight(name:spider), product of: 8.137755 =
idf(docFreq=4, maxDocs=6293) 0.102593474 = queryNorm 2.0344388 = (MATCH)
fieldWeight(name:spider in 46), product of: 1.0 =
tf(termFreq(name:spider)=1) 8.137755 = idf(docFreq=4, maxDocs=6293) 0.25 =
fieldNorm(field=name, doc=46) 0.7382886 = (MATCH) weight(name:man in 46),
product of: 0.5504311 = queryWeight(name:man), product of: 5.3651667 =
idf(docFreq=79, maxDocs=6293) 0.102593474 = queryNorm 1.3412917 = (MATCH)
fieldWeight(name:man in 46), product of: 1.0 = tf(termFreq(name:man)=1)
5.3651667 = idf(docFreq=79, maxDocs=6293) 0.25 = fieldNorm(field=name,
doc=46)
</str>
</lst>

I am going to do clean up the hyphens before submiting the query in the
client side, but I would like to understand what is going on...

Thanks,

-- César

Index definition
============

<fieldType name="uzngramtext" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="55" side="front"/>
 </analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

 Output of the analys tool
===================

Index Analyzer
org.apache.solr.analysis.StandardTokenizerFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
startOffset 0 7
endOffset 6 10
type <ALPHANUM> <ALPHANUM>
org.apache.solr.analysis.StandardFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10
org.apache.solr.analysis.ASCIIFoldingFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10
org.apache.solr.analysis.LowerCaseFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10
org.apache.solr.analysis.EdgeNGramFilterFactory {maxGramSize=55,
side=front, minGramSize=1, luceneMatchVersion=LUCENE_31}
position 1 2 3 4 5 6 7 8 9
term text s sp spi spid spide spider m ma man
startOffset 0 0 0 0 0 0 7 7 7
endOffset 1 2 3 4 5 6 8 9 10
type word word word word word word word word word

Query Analyzer
org.apache.solr.analysis.StandardTokenizerFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
startOffset 0 7
endOffset 6 10
type <ALPHANUM> <ALPHANUM>
org.apache.solr.analysis.StandardFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10
org.apache.solr.analysis.ASCIIFoldingFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10
org.apache.solr.analysis.LowerCaseFilterFactory
{luceneMatchVersion=LUCENE_31}
position 1 2
term text spider man
type <ALPHANUM> <ALPHANUM>
startOffset 0 7
endOffset 6 10

Re: Why a query with a hyphen is not returning a document?

Posted by Erick Erickson <er...@gmail.com>.

I see what, but I'm not sure why.

name:spider-man winds up being a phrase query, so if "spider" and "man" don't
appear next to each other you won't match. And your indexing process is
messing with the term positions when it injects the grams in your index,
so spider is in position 6 and man is in position 9.

+name:spider +name:man simply require that the two terms appear in the
field, no proximity is implied.

Why spider-man is generating a phrase query isn't clear to me.....

BTW, gramming on only the index side is highly questionable, you might
want to examine your index (admin/schema browser) to see the
effects of this...

A common issue with using the analysis page is that it only shows you what
happens _after_ the query parser has done its tricks. In your case,
for instance,
I think the issue is the phrase-query generation.

On Wed, Apr 25, 2012 at 12:26 PM, Cesar Ortiz <ce...@gmail.com> wrote:
> Hi,
>
> I have a problem
>
> I don't understand why when I submit the query 'spider-man' to solr it
> doesn't return any document whereas if I submit 'spider man' it does.
> Using the analyzer tool it highlights the terms spider and man, so I don't
> know what is going on. Does the analyzer showing wrong information?
>
> If I execute the query
> http://localhost:8983/solr/select/?q=name%3Aspider-man&version=2.2&start=0&rows=10&indent=on&debugQuery=trueI
> get:
>
> <str name="rawquerystring">name:spider-man</str>
> <str name="querystring">name:spider-man</str>
> <str name="parsedquery">PhraseQuery(name:"spider man")</str>
> <str name="parsedquery_toString">name:"spider man"</str>
> <lst name="explain"/>
> <str name="QParser">LuceneQParser</str>
>
> whereas if I execute
> http://localhost:8983/solr/select/?q=name%3Aspider+name%3Aman&version=2.2&start=0&rows=10&indent=on&debugQuery=true
> I
> get:
>
>
> <str name="rawquerystring">name:spider name:man</str>
> <str name="querystring">name:spider name:man</str>
> <str name="parsedquery">+name:spider +name:man</str>
> <str name="parsedquery_toString">+name:spider +name:man</str>
> <lst name="explain">
> <str name="movie-spider-man-2002">
> 3.0460029 = (MATCH) sum of: 2.123142 = (MATCH) weight(name:spider in 326),
> product of: 0.8348806 = queryWeight(name:spider), product of: 8.137755 =
> idf(docFreq=4, maxDocs=6293) 0.102593474 = queryNorm 2.5430486 = (MATCH)
> fieldWeight(name:spider in 326), product of: 1.0 =
> tf(termFreq(name:spider)=1) 8.137755 = idf(docFreq=4, maxDocs=6293) 0.3125
> = fieldNorm(field=name, doc=326) 0.92286074 = (MATCH) weight(name:man in
> 326), product of: 0.5504311 = queryWeight(name:man), product of: 5.3651667
> = idf(docFreq=79, maxDocs=6293) 0.102593474 = queryNorm 1.6766145 = (MATCH)
> fieldWeight(name:man in 326), product of: 1.0 = tf(termFreq(name:man)=1)
> 5.3651667 = idf(docFreq=79, maxDocs=6293) 0.3125 = fieldNorm(field=name,
> doc=326)
> </str>
> <str name="movie-spider-man-2-2004-2004">
> 2.4368021 = (MATCH) sum of: 1.6985135 = (MATCH) weight(name:spider in 46),
> product of: 0.8348806 = queryWeight(name:spider), product of: 8.137755 =
> idf(docFreq=4, maxDocs=6293) 0.102593474 = queryNorm 2.0344388 = (MATCH)
> fieldWeight(name:spider in 46), product of: 1.0 =
> tf(termFreq(name:spider)=1) 8.137755 = idf(docFreq=4, maxDocs=6293) 0.25 =
> fieldNorm(field=name, doc=46) 0.7382886 = (MATCH) weight(name:man in 46),
> product of: 0.5504311 = queryWeight(name:man), product of: 5.3651667 =
> idf(docFreq=79, maxDocs=6293) 0.102593474 = queryNorm 1.3412917 = (MATCH)
> fieldWeight(name:man in 46), product of: 1.0 = tf(termFreq(name:man)=1)
> 5.3651667 = idf(docFreq=79, maxDocs=6293) 0.25 = fieldNorm(field=name,
> doc=46)
> </str>
> </lst>
>
> I am going to do clean up the hyphens before submiting the query in the
> client side, but I would like to understand what is going on...
>
> Thanks,
>
> -- César
>
> Index definition
> ============
>
> <fieldType name="uzngramtext" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="55" side="front"/>
>  </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
>  Output of the analys tool
> ===================
>
> Index Analyzer
> org.apache.solr.analysis.StandardTokenizerFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> startOffset 0 7
> endOffset 6 10
> type <ALPHANUM> <ALPHANUM>
> org.apache.solr.analysis.StandardFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10
> org.apache.solr.analysis.ASCIIFoldingFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10
> org.apache.solr.analysis.EdgeNGramFilterFactory {maxGramSize=55,
> side=front, minGramSize=1, luceneMatchVersion=LUCENE_31}
> position 1 2 3 4 5 6 7 8 9
> term text s sp spi spid spide spider m ma man
> startOffset 0 0 0 0 0 0 7 7 7
> endOffset 1 2 3 4 5 6 8 9 10
> type word word word word word word word word word
>
> Query Analyzer
> org.apache.solr.analysis.StandardTokenizerFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> startOffset 0 7
> endOffset 6 10
> type <ALPHANUM> <ALPHANUM>
> org.apache.solr.analysis.StandardFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10
> org.apache.solr.analysis.ASCIIFoldingFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10
> org.apache.solr.analysis.LowerCaseFilterFactory
> {luceneMatchVersion=LUCENE_31}
> position 1 2
> term text spider man
> type <ALPHANUM> <ALPHANUM>
> startOffset 0 7
> endOffset 6 10