You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Alessandro Benedetti (JIRA)" <ji...@apache.org> on 2015/12/31 17:05:49 UTC

[jira] [Created] (LUCENE-6954) More Like This Query Generation

Alessandro Benedetti created LUCENE-6954:
--------------------------------------------

             Summary: More Like This Query Generation 
                 Key: LUCENE-6954
                 URL: https://issues.apache.org/jira/browse/LUCENE-6954
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/other
    Affects Versions: 5.4
            Reporter: Alessandro Benedetti


Currently the query is generated : 
org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
1) we extract the terms from the interesting fields, adding them to a map :
Map<String, Int> termFreqMap = new HashMap<>();
( we lose the relation field-> term, we don't know anymore where the term was coming ! )

org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
2) we build the queue that will contain the query terms, at this point we connect again there terms to some field, but :
...
// go through all the fields and find the largest document frequency
String topField = fieldNames[0];
int docFreq = 0;
for (String fieldName : fieldNames) {
  int freq = ir.docFreq(new Term(fieldName, word));
  topField = (freq > docFreq) ? fieldName : topField;
  docFreq = (freq > docFreq) ? freq : docFreq;
}
...

We identify the topField as the field with the highest document frequency for the term t .
Then we build the termQuery :

queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));

In this way we lose a lot of precision.
Not sure why we do that.
I would prefer to keep the relation between terms and fields.
The MLT query can improve a lot the quality.
If i run the MLT on 2 fields : weSell and weDontSell for example.
It is likely I want to find documents with similar terms in the weSell and similar terms in the weDontSell, without mixing up the things and loosing the semantic of the terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org