You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hassan Saneifar <ha...@lirmm.fr> on 2010/12/01 13:56:27 UTC

Hit score and confidence ratio in results

Hello,
I'm using lucene to retrieve relevant segments of a corpus based on a
given query. Every segment is represented as Document in the indexing.
Once the relevant segments are retrieved, I search for a Regex in them
to capture the requested information. It can happens that I found the
regex in two or more documents retrieved by lucene. Thus, I would like
to show a confidence score to each captured Regex. To make simple, I
means if the regex is found in the first retrieved document and in
fourth, the one retrieved in the first document is more certain to be
relevant since the its document was better scored than other one.

To show a confidence score, I tried to use scores returned by Lucene.
But, the lucene score are arbitrary and not normalized. So it's not
relevant to use. I wonder how we can have an arbitrary score (>1) when
the default scoring system in lucene is based on cosine measure. In a
simple case, the cosine score between two document vectors (obtained by
tf-idf) is between 0 and 1.

Since the lucene score is arbitrary and not normalized, if I want to
verify which query is more relevant to retrieve a document, how can I
compare the score of the first retrieved documents corresponding to each
query ? Is there any way to give a confidence score to each retrieved
document which make the comparison of the results of the different
search using the different queries possible?

Many thank
Best regards.

-- 
Hassan Saneifar, PhD. student,
University Montpellier II
LIRMM laboratory
161 rue Ada 34392 Montpellier, France
Tel: +33 (0)467 41 85 41
Fax: +33 (0)467 41 85 00		
URL: http://www.lirmm.fr/~saneifar
				
--
Hassan Saneifar, R&D engineer
Satin IP Technologies
Tel: +33 (0)467 13 00 86
URL: http://www.satin-ip.com				
 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Hit score and confidence ratio in results

Posted by Erick Erickson <er...@gmail.com>.
If I understand your problem space, you're trying to compare scores
across two different queries. Don't do this, it is meaningless. Scores
are only valid for comparing documents' relevance within a single
query. How would one compare scores between two queries like
text:nonsense
name:terrible
? Different fields, different values. What does it mean that the score
for a doc matching the first query has a higher or lower score than
the first doc matching the second query?

As for how scores are > 1, have you looked at the scoring
formula here?
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
and here? http://lucene.apache.org/java/2_4_0/scoring.html#Understanding the
Scoring Formula

Don't forget that there can also be boosts...

Best
Erick

On Wed, Dec 1, 2010 at 7:56 AM, Hassan Saneifar <ha...@lirmm.fr>wrote:

> Hello,
> I'm using lucene to retrieve relevant segments of a corpus based on a
> given query. Every segment is represented as Document in the indexing.
> Once the relevant segments are retrieved, I search for a Regex in them
> to capture the requested information. It can happens that I found the
> regex in two or more documents retrieved by lucene. Thus, I would like
> to show a confidence score to each captured Regex. To make simple, I
> means if the regex is found in the first retrieved document and in
> fourth, the one retrieved in the first document is more certain to be
> relevant since the its document was better scored than other one.
>
> To show a confidence score, I tried to use scores returned by Lucene.
> But, the lucene score are arbitrary and not normalized. So it's not
> relevant to use. I wonder how we can have an arbitrary score (>1) when
> the default scoring system in lucene is based on cosine measure. In a
> simple case, the cosine score between two document vectors (obtained by
> tf-idf) is between 0 and 1.
>
> Since the lucene score is arbitrary and not normalized, if I want to
> verify which query is more relevant to retrieve a document, how can I
> compare the score of the first retrieved documents corresponding to each
> query ? Is there any way to give a confidence score to each retrieved
> document which make the comparison of the results of the different
> search using the different queries possible?
>
> Many thank
> Best regards.
>
> --
> Hassan Saneifar, PhD. student,
> University Montpellier II
> LIRMM laboratory
> 161 rue Ada 34392 Montpellier, France
> Tel: +33 (0)467 41 85 41
> Fax: +33 (0)467 41 85 00
> URL: http://www.lirmm.fr/~saneifar
>
> --
> Hassan Saneifar, R&D engineer
> Satin IP Technologies
> Tel: +33 (0)467 13 00 86
> URL: http://www.satin-ip.com
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>