You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2017/10/31 02:43:00 UTC

[jira] [Updated] (LUCENE-8025) compute avgdl correctly for DOCS_ONLY

     [ https://issues.apache.org/jira/browse/LUCENE-8025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-8025:
--------------------------------
    Attachment: LUCENE-8025.patch

patch. it falls back to the bogus value only if sumDocFreq is unavailable, which doesn't happen with any codecs since lucene 4 or so.

note for SimilarityBase it doesn't just correct avgdl but also the numberOfFieldTokens, which was previously (bogusly) set to docFreq as if the term being scored was the only one in the collection! I will update tests across more sims such as LM and DFI that are sensitive to this to see any improvement.

> compute avgdl correctly for DOCS_ONLY
> -------------------------------------
>
>                 Key: LUCENE-8025
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8025
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-8025.patch
>
>
> Spinoff of LUCENE-8007:
> If you omit term frequencies, we should score as if all tf values were 1. This is the way it worked for e.g. ClassicSimilarity and you can understand how it degrades. 
> However for sims such as BM25, we bail out on computing avg doclength (and just return a bogus value of 1) today, screwing up stuff related to length normalization too, which is separate.
> Instead of a bogus value, we should substitute sumDocFreq for sumTotalTermFreq (all postings have freq of 1, since you omitted them).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org