You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "O. Olson" <ol...@yahoo.it> on 2014/07/25 00:45:20 UTC

Understanding the Debug explanations for Query Result Scoring/Ranking

Hi,

	If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results. 

	I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it. 

	My problem is trying to understand something like this: 

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like: 
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by Jack Krupansky <ja...@basetechnology.com>.

The formatting is one thing, but ultimately it is just a giant expression, 
one for each document. The expression is computing the score, based on your 
chosen or default "similarity" algorithm. All the terms in the expressions 
are detailed here:

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Unless you dive into that math (not so bad, really, if you are motivated), 
the expressions are going to be rather opaque to you.

The long floating point numbers are mostly just the intermediate (and final) 
calculations of the math described above.

Try constructing a very simple collection of simple, contrived documents, 
like a short sentence in each, with some common terms, and then try simply 
queries to see how the expression term values change. Try computing TF, DF, 
IDF yourself (just count the terms by hand), and compare to what debug gives 
you.

-- Jack Krupansky

-----Original Message----- 
From: O. Olson
Sent: Thursday, July 24, 2014 6:45 PM
To: solr-user@lucene.apache.org
Subject: Understanding the Debug explanations for Query Result 
Scoring/Ranking

Hi,

If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Hi,

In addition, this might be useful:

Fundamentals of Information Retrieval, Illustration with Apache Lucene
https://www.youtube.com/watch?v=SCsS5ePGmCs

This video is about 40 minutes long, but you can fast forward to 24:00
to learn scoring based on vector space model and how Lucene customize it.

Koji
-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/25 8:00), Uwe Reh wrote:
> Hi,
>
> to get an idea of the meaning of all this numbers, have a look on http://explain.solr.pl. I like
> this tool, it's great.
>
> Uwe
>
> Am 25.07.2014 00:45, schrieb O. Olson:
>> Hi,
>>
>>     If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
>> current output is not XML)/, you would get a node in the resulting XML that
>> is named "debug". There is a child node to this called "explain" to this
>> which has a list showing why the results are ranked in a particular order.
>> I'm curious if there is some documentation on understanding these
>> numbers/results.
>>
>>     I am new to Solr, so I apologize that I may be using the wrong terms to
>> describe my problem. I also aware of
>> http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>> though I have not completely understood it.
>>
>>     My problem is trying to understand something like this:
>>
>> 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
>> 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
>> = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
>> 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
>> = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
>> termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
>> fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
>> [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
>> termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
>> idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
>> fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
>> 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
>> fieldNorm(doc=44109)
>>
>> *Note:* I have searched for "televisions". My search field is a single
>> catch-all field. The Edismax parser seems to break up my search term into
>> "televis" and "tv"
>>
>> Is there some documentation on how to understand these numbers. They do not
>> seem to be properly delimited. At the minimum, I can understand something
>> like:
>> 1.5797625 =  0.4717142 + 1.1080483
>> and
>> 0.71447384  = 7.0424104 * 0.10145303
>>
>> But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
>> = fieldWeight in 44109" is used in the calculation anywhere. Also since
>> there were only two terms /("televis" and "tv")/ I could use subtraction to
>> find out 1.1080483 was the start of a new result.
>>
>> I'd also appreciate if someone can tell me which class dumps out the above
>> data. If I know it, I can edit that class to make the output a bit more
>> understandable for me.
>>
>> Thank you,
>> O. O.
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
>>
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by "O. Olson" <ol...@yahoo.it>.

Thank you very much Chris. I was not aware of debug.explain.structured. It
seems to be what I was looking for. 

Thanks also to Jack Krupansky. Yes, delving into those numbers would be my
next step, but I will get to that later.
O. O.


Chris Hostetter-3 wrote
> Just to be clear, regardless of *which* response writer you use (xml, 
> ruby, json, etc...) the default behavior is to include the score 
> explanation sa a single string which uses tabs/newlines to deal with the 
> nested (this nesting is visible if you view the raw response, no matter 
> what ResponseWriter)
> 
> You can however add a param indicating that you want the explaantion 
> information to be returned as a *structured data* instead o a simple 
> string...
> 
> https://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured
> 
> ...if you wnat to programatically process debug info, this is the 
> recomended way to to so.
> 
> -Hoss
> http://www.lucidworks.com/





--
View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149521.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by Chris Hostetter <ho...@fucit.org>.

: Thank you very much Erik. This is exactly what I was looking for. While at
: the moment I have no clue about these numbers, they ruby formatting makes it
: much more easier to understand.

Just to be clear, regardless of *which* response writer you use (xml, 
ruby, json, etc...) the default behavior is to include the score 
explanation sa a single string which uses tabs/newlines to deal with the 
nested (this nesting is visible if you view the raw response, no matter 
what ResponseWriter)

You can however add a param indicating that you want the explaantion 
information to be returned as a *structured data* instead o a simple 
string...

https://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured

...if you wnat to programatically process debug info, this is the 
recomended way to to so.

-Hoss
http://www.lucidworks.com/

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by "O. Olson" <ol...@yahoo.it>.

Thank you very much Erik. This is exactly what I was looking for. While at
the moment I have no clue about these numbers, they ruby formatting makes it
much more easier to understand.

Thanks to you Koji. I'm sorry I did not acknowledge you before. I think
Erik's solution is what I was looking for.
O. O.



Erik Hatcher-4 wrote
> The format of the XML explain output is not indented or very readable. 
> When I really need to see the explain indented, I use wt=ruby&indent=true
> (I don’t think the indent parameter is relevant for the explain output,
> but I use it anyway)
> 
> 	Erik





--
View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149226.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by Erik Hatcher <er...@gmail.com>.

The format of the XML explain output is not indented or very readable.  When I really need to see the explain indented, I use wt=ruby&indent=true (I don’t think the indent parameter is relevant for the explain output, but I use it anyway)

	Erik

On Jul 25, 2014, at 10:11 AM, O. Olson <ol...@yahoo.it> wrote:

> Thank you Uwe. Unfortunately, I could not get your explain solr website to
> work. I always get an error saying "Ops. We have internal server error. This
> event was logged. We will try fix this soon. We are sorry for
> inconvenience."
> 
> At this point, I know that I need to have some technical background to
> understanding how these numbers are calculated. However even with that, I am
> sure that the format of this output is not obvious. I am curious about the
> documentation of this output format. It seems to be unintelligible. 
> 
> If this is not documented anywhere, can someone point me to which class is
> doing this output.
> 
> Thank you,
> O. O.
> 
> 
> an6 wrote
>> Hi,
>> 
>> to get an idea of the meaning of all this numbers, have a look on 
>> http://explain.solr.pl. I like this tool, it's great.
>> 
>> Uwe
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149217.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by "O. Olson" <ol...@yahoo.it>.

Thank you Uwe. Unfortunately, I could not get your explain solr website to
work. I always get an error saying "Ops. We have internal server error. This
event was logged. We will try fix this soon. We are sorry for
inconvenience."

At this point, I know that I need to have some technical background to
understanding how these numbers are calculated. However even with that, I am
sure that the format of this output is not obvious. I am curious about the
documentation of this output format. It seems to be unintelligible. 

If this is not documented anywhere, can someone point me to which class is
doing this output.

Thank you,
O. O.


an6 wrote
> Hi,
> 
> to get an idea of the meaning of all this numbers, have a look on 
> http://explain.solr.pl. I like this tool, it's great.
> 
> Uwe





--
View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

Posted by Uwe Reh <re...@hebis.uni-frankfurt.de>.

Hi,

to get an idea of the meaning of all this numbers, have a look on 
http://explain.solr.pl. I like this tool, it's great.

Uwe

Am 25.07.2014 00:45, schrieb O. Olson:
> Hi,
>
> 	If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
> current output is not XML)/, you would get a node in the resulting XML that
> is named "debug". There is a child node to this called "explain" to this
> which has a list showing why the results are ranked in a particular order.
> I'm curious if there is some documentation on understanding these
> numbers/results.
>
> 	I am new to Solr, so I apologize that I may be using the wrong terms to
> describe my problem. I also aware of
> http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> though I have not completely understood it.
>
> 	My problem is trying to understand something like this:
>
> 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
> 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
> = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
> 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
> = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
> termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
> fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
> [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
> termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
> idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
> fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
> 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
> fieldNorm(doc=44109)
>
> *Note:* I have searched for "televisions". My search field is a single
> catch-all field. The Edismax parser seems to break up my search term into
> "televis" and "tv"
>
> Is there some documentation on how to understand these numbers. They do not
> seem to be properly delimited. At the minimum, I can understand something
> like:
> 1.5797625 =  0.4717142 + 1.1080483
> and
> 0.71447384  = 7.0424104 * 0.10145303
>
> But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
> = fieldWeight in 44109" is used in the calculation anywhere. Also since
> there were only two terms /("televis" and "tv")/ I could use subtraction to
> find out 1.1080483 was the start of a new result.
>
> I'd also appreciate if someone can tell me which class dumps out the above
> data. If I know it, I can edit that class to make the output a bit more
> understandable for me.
>
> Thank you,
> O. O.
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>