You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "mattia.martinello@gmail.com" <ma...@gmail.com> on 2012/08/27 15:19:17 UTC

Understanding SOLR search results

Hi.
I get some strange results for one query from SOLR.

This is an example query:

<str name="q">
(titolo:trenti OR sommario:trenti OR occhiello:trenti OR testo:trenti)
</str>

In the results I have this document:

<result name="response" numFound="1" start="0" maxScore="6.5818048">
<doc>
<float name="score">6.5818048</float>
<str name="id">503af94e0c342</str>
<str name="occhiello">IL PROGETTO..........  (no word "tren" in
"occhiello" field) </str>
<str name="sommario"/>
<str name="sottotitolo">
C'è la concessione edilizia. Gli islamici...
</str> (no word "tren" in "sottotitolo" field).
<str name="testo">
La fine del ramadan, pochi giorni fa.......
</str> (no word "tren" in "testo" field).
<str name="titolo">Moschea in viale.....</str> (no word "tren" in
"titolo" field).
</doc>
</result>

This document does not have the word "tren" in any of the fields
titolo, occhiello or testo, but:

<lst name="highlighting">
<lst name="503af94e0c342"/>
</lst>

So, as I can see, this document was selected for the field "id". This
is the debug:

<lst name="explain">
<str name="503af94e0c342">
6.5818048 = (MATCH) sum of: 6.3718405 = (MATCH)
weight(id:503af94e0c342 in 48107), product of: 0.57440555 =
queryWeight(id:503af94e0c342), product of: 11.09293 = idf(docFreq=1,
maxDocs=48343) 0.05178123 = queryNorm 11.09293 = (MATCH)
fieldWeight(id:503af94e0c342 in 48107), product of: 1.0 =
tf(termFreq(id:503af94e0c342)=1) 11.09293 = idf(docFreq=1,
maxDocs=48343) 1.0 = fieldNorm(field=id, doc=48107) 0.20996419 =
(MATCH) product of: 0.83985674 = (MATCH) sum of: 0.83985674 = (MATCH)
weight(titolo:trent in 48107), product of: 0.34054396 =
queryWeight(titolo:trent), product of: 6.5765905 = idf(docFreq=182,
maxDocs=48343) 0.05178123 = queryNorm 2.4662213 = (MATCH)
fieldWeight(titolo:trent in 48107), product of: 1.0 =
tf(termFreq(titolo:trent)=1) 6.5765905 = idf(docFreq=182,
maxDocs=48343) 0.375 = fieldNorm(field=titolo, doc=48107) 0.25 =
coord(1/4)
</str>
</lst>

The field "id" is an indexed string field:

<field name="id" type="string" indexed="true" stored="true" required="true" />

Could you help me to understand this behavior well, please?

Thank you very much!
Bye,
Mattia.

Re: Understanding SOLR search results

Posted by Jack Krupansky <ja...@basetechnology.com>.

If the stemming is too bad, just remove that token filter from the field 
type. But, you will have to re-index whenever you make such a drastic change 
(the terms in the index will be different.)

-- Jack Krupansky

-----Original Message----- 
From: mattia.martinello@gmail.com
Sent: Monday, August 27, 2012 2:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Understanding SOLR search results

2012/8/27 Mike Schultz <mi...@gmail.com>:
> Can you include the entire text for only the titolo field?

The entire text for the titolo field is "Moschea in viale Trento,
partono i lavori".

I tried to change the type of the titolo field from text to textgen,
and now it does not match.

I think it is a stemming problem, but I cannot use
"KeywordMarkerFilter" for every wrong-stemmed word, because I cannot
suppose how much they are.

Re: Understanding SOLR search results

Posted by "mattia.martinello@gmail.com" <ma...@gmail.com>.

2012/8/27 Mike Schultz <mi...@gmail.com>:
> Can you include the entire text for only the titolo field?

The entire text for the titolo field is "Moschea in viale Trento,
partono i lavori".

I tried to change the type of the titolo field from text to textgen,
and now it does not match.

I think it is a stemming problem, but I cannot use
"KeywordMarkerFilter" for every wrong-stemmed word, because I cannot
suppose how much they are.

Re: Understanding SOLR search results

Posted by Mike Schultz <mi...@gmail.com>.

Can you include the entire text for only the titolo field?  

1.0 = tf(termFreq(titolo:trent)=1) means the index contains one hit for
'trent' for that field, that doc.

Mike



--
View this message in context: http://lucene.472066.n3.nabble.com/Understanding-SOLR-search-results-tp4003480p4003540.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding SOLR search results

Posted by Jack Krupansky <ja...@basetechnology.com>.

I surmise that you are using the "text_it" field type, or something similar. 
It has:

<filter class="solr.ItalianLightStemFilterFactory"/>

When I enter "trento" into the Solr admin analysis page that last filter 
transforms "trento" into "trent", just as we see in the query explain.

So, indeed, this looks like a stemming anomaly.

I see this comment in the code: "To prevent terms from being stemmed use an 
instance of KeywordMarkerFilter", so you could use 
"solr.KeywordMarkerFilterFactory" and created a protected words list text 
file.

-- Jack Krupansky

-----Original Message----- 
From: mattia.martinello@gmail.com
Sent: Monday, August 27, 2012 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Understanding SOLR search results

> Maybe you have a synonym in the title field? Or maybe some stemming 
> anomaly?

The complete title is "Moschea in viale Trento, partono i lavori", so
"Trent" should be a substring of the word "Trento".
But if I search for "Mos" or "lavo", I don't have this result, so I
don't understand why "Trent" is taken as a subword, and "Mos" and
"lavo" are not.

Do you have any idea?

Re: Understanding SOLR search results

Posted by "mattia.martinello@gmail.com" <ma...@gmail.com>.

> Maybe you have a synonym in the title field? Or maybe some stemming anomaly?

The complete title is "Moschea in viale Trento, partono i lavori", so
"Trent" should be a substring of the word "Trento".
But if I search for "Mos" or "lavo", I don't have this result, so I
don't understand why "Trent" is taken as a subword, and "Mos" and
"lavo" are not.

Do you have any idea?

Re: Understanding SOLR search results

Posted by Jack Krupansky <ja...@basetechnology.com>.

Maybe you have a synonym in the title field? Or maybe some stemming anomaly?

Try using the Solr admin analyzer and enter the query text for the title 
field and see how it analyzes.

In any case, the explain is clearing saying that titolo:trent was a match. 
Regardless of what source text you gave for that field, one of the terms was 
filtered to be "trent".

-- Jack Krupansky

-----Original Message----- 
From: mattia.martinello@gmail.com
Sent: Monday, August 27, 2012 9:19 AM
To: solr-user@lucene.apache.org
Subject: Understanding SOLR search results

Hi.
I get some strange results for one query from SOLR.

This is an example query:

<str name="q">
(titolo:trenti OR sommario:trenti OR occhiello:trenti OR testo:trenti)
</str>

In the results I have this document:

<result name="response" numFound="1" start="0" maxScore="6.5818048">
<doc>
<float name="score">6.5818048</float>
<str name="id">503af94e0c342</str>
<str name="occhiello">IL PROGETTO..........  (no word "tren" in
"occhiello" field) </str>
<str name="sommario"/>
<str name="sottotitolo">
C'è la concessione edilizia. Gli islamici...
</str> (no word "tren" in "sottotitolo" field).
<str name="testo">
La fine del ramadan, pochi giorni fa.......
</str> (no word "tren" in "testo" field).
<str name="titolo">Moschea in viale.....</str> (no word "tren" in
"titolo" field).
</doc>
</result>

This document does not have the word "tren" in any of the fields
titolo, occhiello or testo, but:

<lst name="highlighting">
<lst name="503af94e0c342"/>
</lst>

So, as I can see, this document was selected for the field "id". This
is the debug:

<lst name="explain">
<str name="503af94e0c342">
6.5818048 = (MATCH) sum of: 6.3718405 = (MATCH)
weight(id:503af94e0c342 in 48107), product of: 0.57440555 =
queryWeight(id:503af94e0c342), product of: 11.09293 = idf(docFreq=1,
maxDocs=48343) 0.05178123 = queryNorm 11.09293 = (MATCH)
fieldWeight(id:503af94e0c342 in 48107), product of: 1.0 =
tf(termFreq(id:503af94e0c342)=1) 11.09293 = idf(docFreq=1,
maxDocs=48343) 1.0 = fieldNorm(field=id, doc=48107) 0.20996419 =
(MATCH) product of: 0.83985674 = (MATCH) sum of: 0.83985674 = (MATCH)
weight(titolo:trent in 48107), product of: 0.34054396 =
queryWeight(titolo:trent), product of: 6.5765905 = idf(docFreq=182,
maxDocs=48343) 0.05178123 = queryNorm 2.4662213 = (MATCH)
fieldWeight(titolo:trent in 48107), product of: 1.0 =
tf(termFreq(titolo:trent)=1) 6.5765905 = idf(docFreq=182,
maxDocs=48343) 0.375 = fieldNorm(field=titolo, doc=48107) 0.25 =
coord(1/4)
</str>
</lst>

The field "id" is an indexed string field:

<field name="id" type="string" indexed="true" stored="true" required="true" 
/>

Could you help me to understand this behavior well, please?

Thank you very much!
Bye,
Mattia.