You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2017/12/04 18:35:00 UTC

[jira] [Commented] (LUCENE-8015) TestBasicModelIne.testRandomScoring failure

    [ https://issues.apache.org/jira/browse/LUCENE-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277214#comment-16277214 ] 

Adrien Grand commented on LUCENE-8015:
--------------------------------------

I have been looking into the following model G failure.

{noformat}
7.0E-45 = score(DFRSimilarity, doc=0, freq=0.99999994), computed from:
  1.4E-45 = boost
  3.09640771E16 = NormalizationH1, computed from: 
    0.99999994 = tf
    1.61490253E9 = avgFieldLength
    112.0 = len
  9.2892231E16 = BasicModelG, computed from: 
    12.0 = numberOfDocuments
    1.0 = totalTermFreq
  4.8443234E-17 = AfterEffectB, computed from: 
    3.09640771E16 = tfn
    1.0 = totalTermFreq
    1.0 = docFreq

5.6E-45 = score(DFRSimilarity, doc=0, freq=1.0), computed from:
  1.4E-45 = boost
  3.09640792E16 = NormalizationH1, computed from: 
    1.0 = tf
    1.61490253E9 = avgFieldLength
    112.0 = len
  9.289224E16 = BasicModelG, computed from: 
    12.0 = numberOfDocuments
    1.0 = totalTermFreq
  4.844323E-17 = AfterEffectB, computed from: 
    3.09640792E16 = tfn
    1.0 = totalTermFreq
    1.0 = docFreq

DFR GB1
field="field",maxDoc=46519,docCount=12,sumTotalTermFreq=19378830951,sumDocFreq=19378830951
term="term",docFreq=1,totalTermFreq=1
norm=59 (doc length ~ 112)
freq=1.0
NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelG -Dtests.method=testRandomScoring -Dtests.seed=3C22B051C61EEC84 -Dtests.locale=cs-CZ -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true -Dtests.file.encoding=UTF-8
{noformat}

In short, the scoring formula here looks like {{(A + B * tfn) * (C / (tfn + 1))}} where A, B and C are constants. This function increases when tfn increases when B > A, which is always the case. The problem is that tfn is so large (ulp(tfn) = 4) , that {{tfn+1}} always returns {{tfn}} and {{A + B * tfn}} always returns the same as {{B * tfn}}. So when tfn gets high, the formula is effectively {{(B * tfn) * (C / tfn)}}. This is a constant, but since we compute the left and right parts independently, this might decrease when tfn increases about half the time.

Even though I triggered it with BasicModelG, I suspect it affects almost all DFRSimilarity impls.

> TestBasicModelIne.testRandomScoring failure
> -------------------------------------------
>
>                 Key: LUCENE-8015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8015
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>         Attachments: LUCENE-8015_test_fangs.patch
>
>
> reproduce with: ant test  -Dtestcase=TestBasicModelIne -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu -Dtests.asserts=true -Dtests.file.encoding=UTF8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org