You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alessandro Benedetti (JIRA)" <ji...@apache.org> on 2018/05/31 14:51:00 UTC

[jira] [Issue Comment Deleted] (LUCENE-6687) MLT term frequency calculation bug

     [ https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Benedetti updated LUCENE-6687:
-----------------------------------------
    Comment: was deleted

(was: I just checked the source code and I find a different logic in place, I assume this bug was fixed long time ago.
Can anyone close this Jira ?)

> MLT term frequency calculation bug
> ----------------------------------
>
>                 Key: LUCENE-6687
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6687
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/queryparser
>    Affects Versions: 5.2.1, 6.0
>         Environment: OS X v10.10.4; Solr 5.2.1
>            Reporter: Marko Bonaci
>            Priority: Major
>             Fix For: 5.2.2
>
>         Attachments: LUCENE-6687.patch, buggy-method-usage.png, solr-mlt-tf-doubling-bug-results.png, solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, terms-glass.png, terms-how.png
>
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same set of fields.
> That effectively doubles the term frequency for all the terms from fields that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, the version of overloaded method {{like}} that receives a Map: so that private class member {{fieldNames}} is always derived from {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by the time {{retrieveTerms}} method gets called, its parameter fields and private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't want to consider terms that appear less than 14 times (when terms from fields {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I applied the patch: [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org