You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alessandro Benedetti (JIRA)" <ji...@apache.org> on 2015/12/31 17:05:49 UTC
[jira] [Created] (LUCENE-6954) More Like This Query Generation
Alessandro Benedetti created LUCENE-6954:
--------------------------------------------
Summary: More Like This Query Generation
Key: LUCENE-6954
URL: https://issues.apache.org/jira/browse/LUCENE-6954
Project: Lucene - Core
Issue Type: Improvement
Components: modules/other
Affects Versions: 5.4
Reporter: Alessandro Benedetti
Currently the query is generated :
org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
1) we extract the terms from the interesting fields, adding them to a map :
Map<String, Int> termFreqMap = new HashMap<>();
( we lose the relation field-> term, we don't know anymore where the term was coming ! )
org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
2) we build the queue that will contain the query terms, at this point we connect again there terms to some field, but :
...
// go through all the fields and find the largest document frequency
String topField = fieldNames[0];
int docFreq = 0;
for (String fieldName : fieldNames) {
int freq = ir.docFreq(new Term(fieldName, word));
topField = (freq > docFreq) ? fieldName : topField;
docFreq = (freq > docFreq) ? freq : docFreq;
}
...
We identify the topField as the field with the highest document frequency for the term t .
Then we build the termQuery :
queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
In this way we lose a lot of precision.
Not sure why we do that.
I would prefer to keep the relation between terms and fields.
The MLT query can improve a lot the quality.
If i run the MLT on 2 fields : weSell and weDontSell for example.
It is likely I want to find documents with similar terms in the weSell and similar terms in the weDontSell, without mixing up the things and loosing the semantic of the terms.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org