You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Leo Boytsov <sr...@cmu.edu> on 2016/07/04 05:53:40 UTC

potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Hi everybody,

Some time ago, I had to re-implement some Lucene similarities (in
particular BM25 and the older cosine). I noticed that the re-implemented
version (despite using the same formula) performed better on my data set.
The main difference was that my version *did not approximate *document
length.

Recently, I have implemented a modification of the current Lucene BM25 that
doesn't use this approximation either. I compared the existing and the
modified similarities (again on some of my quirky data sets). The results
are as follows:

1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5% in my
tests).
2) The modified Lucene BM25 it is also more accurate
(I don't see a good reason as to why memoization and document approximation
results in any efficiency gain at all, but this is what seems to happen
with the current hardware.)

If this potential accuracy degradation concerns you, additional experiments
using more standard collections can be done (e.g., some TREC collections).

In any case, the reproducible example (which also links to a more detailed
explanation) is in my repo:
https://github.com/searchivarius/AccurateLuceneBM25

Many thanks!

---
Leo

Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Posted by Robert Muir <rc...@gmail.com>.

The downsides are documented and known. But I don't think you are
fully documenting the tradeoffs here, by encoding up to a 64-bit long,
you can use up to *8x more memory and disk space* than what lucene
does by default, and that is per-field. So that is a real trap. Maybe
throw an exception there instead if the boost != 1F (just don't
support it), and add a guard for "supermassive" documents, so that at
most only 16 bits are ever used instead of 64. The document need not
really be massive, it can happen just from a strange analysis chain
(n-grams etc) that you get large values here.

I have run comparisons in the past on standard collections to see what
happens with this "feature"  and differences were very small. I think
in practice people do far more damage by sharding their document
collections but not using a distributed interchange of IDF, causing
results from different shards to be incomparable :)

As far as the bm25 tableization, that is *not* the justification for
using an 8 byte encoding. The 8 byte encoding was already there long
ago (to save memory/space) when bm25 was added, that small
optimization just takes advantage of it. The optimization exists just
so that bm25 will have comparable performance to ClassicSimilarity.

Either way, the document's length can be stored with more accuracy,
without wasting space, especially if you don't use index-time
boosting. But the default encoding supports features like that because
lucene supports all these features.

On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <sr...@cmu.edu> wrote:
> Hi everybody,
>
> Some time ago, I had to re-implement some Lucene similarities (in particular
> BM25 and the older cosine). I noticed that the re-implemented version
> (despite using the same formula) performed better on my data set. The main
> difference was that my version did not approximate document length.
>
> Recently, I have implemented a modification of the current Lucene BM25 that
> doesn't use this approximation either. I compared the existing and the
> modified similarities (again on some of my quirky data sets). The results
> are as follows:
>
> 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5% in my
> tests).
> 2) The modified Lucene BM25 it is also more accurate
> (I don't see a good reason as to why memoization and document approximation
> results in any efficiency gain at all, but this is what seems to happen with
> the current hardware.)
>
> If this potential accuracy degradation concerns you, additional experiments
> using more standard collections can be done (e.g., some TREC collections).
>
> In any case, the reproducible example (which also links to a more detailed
> explanation) is in my repo:
> https://github.com/searchivarius/AccurateLuceneBM25
>
> Many thanks!
>
> ---
> Leo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org