You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rifat <ri...@gmail.com> on 2017/06/27 09:45:49 UTC

Is it possible to normalise BM25 scores in the query level?

Hi all,

I searched for this a lot but could not find a clear answer, yet. is there a
way such that Lucene (or Elasticsearch) provides query level normalization
of BM25 scores. Because BM25 scores varies considerably across queries. For
example, is it possible to get scores normalised by the max score for that
query? Since lucene processes docs one at a time and return score for that
document, at that moment, it seems not easy to do the normalisation.

thanks,
rifat
 



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-normalise-BM25-scores-in-the-query-level-tp4342991.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Is it possible to normalise BM25 scores in the query level?

Posted by Rifat <ri...@gmail.com>.

Hi,

How can we normalize BM25 scores by the query length (number of tokens) in
Lucene or elasticsearch? I can access document fields by scripting or lucene
expressions in elasticsearch
(https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-expression.html)
but I could not find a way to get number of query tokens inside a script. So
do I have to write my custom Lucene/elasticsearch query for this? Is there
any other possible way?

thanks
rifat





--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-normalise-BM25-scores-in-the-query-level-tp4342991p4343513.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Is it possible to normalise BM25 scores in the query level?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,
> In our use case, we want to perform learning to rank and train a decision
> tree using BM25 scores as one of our features. Decision trees requires
> normalised features to be able to properly split the data. Since BM25 scores
> for different queries varies considerably, decision tree cannot find a
> suitable threshold to split.

The "old Lucene" query normalization has nothing to do with BM25. This normalization is done based on the query only, just to ensure that numbers are around 1 (which has reasons on early days of lucene where huge scores lead to rounding problems). This was removed in Lucene 7 together with TF-IDF based "coordination factors" in boolean queries. In fact this is an improvement, because the normalization scaled the values by some factor depending on query, making them impossible to compare.
 
> What was the normalisation in Lucene 6? We are using Lucene 6.4.2 but
> could
> not find any way to normalise BM25 scores other than hacking into the code.

In Lucene 7 the scores are no longer normalized and are way better to compare between queries of similar structure and different indexes, but still with no guarantees (of course comparing a query with different number of words or completely different structure is still not easily possible). Plain word-based queries ("match query in Elasticsearch) should be fine if you somehow add your own normalization on the number of terms in the query (e.g, divide score final score by number of terms). For LTR purposes this should be fine. I'd try the Lucene 7 master version to validate if this helps for your use case.

Uwe

> --
> View this message in context: http://lucene.472066.n3.nabble.com/Is-it-
> possible-to-normalise-BM25-scores-in-the-query-level-
> tp4342991p4343048.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Is it possible to normalise BM25 scores in the query level?

Posted by Rifat <ri...@gmail.com>.

Thanks for your message.

In our use case, we want to perform learning to rank and train a decision
tree using BM25 scores as one of our features. Decision trees requires
normalised features to be able to properly split the data. Since BM25 scores
for different queries varies considerably, decision tree cannot find a
suitable threshold to split.

What was the normalisation in Lucene 6? We are using Lucene 6.4.2 but could
not find any way to normalise BM25 scores other than hacking into the code.



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-normalise-BM25-scores-in-the-query-level-tp4342991p4343048.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Is it possible to normalise BM25 scores in the query level?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

Once you have executed the query the TopDocs collector gives you the maximum score. Then you just need to normalize on your own.

But keep in mind: This is not always a good idea, because the maximum score and the score of the first document does not mean to be useful to compare. E.g. if the first, top-ranking result is a very bad match and no better ones are there, there is not reason to say it's a 100% hit!

BTW: Lucene up to version 6 had some internal normalization in place, but this was removed in Lucene 7. The reason is simple: The scores are calculated just to compare them inside the same result set. They were never implemented to be used across different indexes or queries.

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Rifat [mailto:rifatozcan1981@gmail.com]
> Sent: Tuesday, June 27, 2017 11:46 AM
> To: java-user@lucene.apache.org
> Subject: Is it possible to normalise BM25 scores in the query level?
> 
> Hi all,
> 
> I searched for this a lot but could not find a clear answer, yet. is there a
> way such that Lucene (or Elasticsearch) provides query level normalization
> of BM25 scores. Because BM25 scores varies considerably across queries. For
> example, is it possible to get scores normalised by the max score for that
> query? Since lucene processes docs one at a time and return score for that
> document, at that moment, it seems not easy to do the normalisation.
> 
> thanks,
> rifat
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Is-it-
> possible-to-normalise-BM25-scores-in-the-query-level-tp4342991.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org