You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Anshum Gupta (Jira)" <ji...@apache.org> on 2019/09/25 15:54:00 UTC

[jira] [Updated] (LUCENE-8984) MoreLikeThis MLT is biased for uncommon fields

     [ https://issues.apache.org/jira/browse/LUCENE-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anshum Gupta updated LUCENE-8984:
---------------------------------
    Fix Version/s: 8.3
                   master (9.0)

> MoreLikeThis MLT is biased for uncommon fields
> ----------------------------------------------
>
>                 Key: LUCENE-8984
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8984
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Andy Hind
>            Assignee: Anshum Gupta
>            Priority: Major
>             Fix For: master (9.0), 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MLT always uses the total doc count and not the count of docs with the specific field
>  
> To quote Maria Mestre from the discussion on the mailing list - 29/01/19
>  
> {quote}The issue I have is that when retrieving the key scored terms (interestingTerms), the code uses the total number of documents in the index, not the total number of documents with populated “description” field. This is where it’s done in the code: [https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA&m=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ&s=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs&e=]
> The effect of this choice is that the “idf” does not vary much, given that numDocs >> number of documents with “description”, so the key terms end up being just the terms with the highest term frequencies.
> It is inconsistent because the MLT-search then uses these extracted key terms and scores all documents using an idf which is computed only on the subset of documents with “description”. So one part of the MLT uses a different numDocs than another part. This sounds like an odd choice, and not expected at all, and I wonder if I’m missing something.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org