You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/05/07 22:12:01 UTC
[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533301#comment-14533301 ]
Adrien Grand commented on LUCENE-329:
-------------------------------------
I like the patch. Maybe we should blend the total term freq just like we blend the doc freq so that it works too with a similarity that uses the ttf instead of the df?
Also it is the 2nd query (the other one is FuzzyLikeThis) where we need to hack a bit TermContext in order to decouple the computation of statistics from the registration of the term states. I'm wondering if we should improve TermContext to make it easier, eg.
{code}
Index: lucene/core/src/java/org/apache/lucene/index/TermContext.java
===================================================================
--- lucene/core/src/java/org/apache/lucene/index/TermContext.java (revision 1678141)
+++ lucene/core/src/java/org/apache/lucene/index/TermContext.java (working copy)
@@ -117,16 +117,31 @@
* should be derived from a {@link IndexReaderContext}'s leaf ord.
*/
public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
+ register(state, ord);
+ accumulateStatistics(docFreq, totalTermFreq);
+ }
+
+ /**
+ * Expert: Registers and associates a {@link TermState} with an leaf ordinal. The
+ * leaf ordinal should be derived from a {@link IndexReaderContext}'s leaf ord.
+ * On the contrary to {@link #register(TermState, int, int, long)} this method
+ * does NOT update term statistics.
+ */
+ public void register(TermState state, final int ord) {
assert state != null : "state must not be null";
assert ord >= 0 && ord < states.length;
assert states[ord] == null : "state for ord: " + ord
+ " already registered";
+ states[ord] = state;
+ }
+
+ /** Expert: Accumulate term statistics. */
+ public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
this.docFreq += docFreq;
if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
this.totalTermFreq += totalTermFreq;
else
this.totalTermFreq = -1;
- states[ord] = state;
}
/**
{code}
> Fuzzy query scoring issues
> --------------------------
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
> Reporter: Mark Harwood
> Assignee: Mark Harwood
> Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix,
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries
> because of the volume of terms introduced (A match on query Foo~ is 0.1
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct spellings.
> I will attach a patch that corrects the issues identified above by
> 1) Overriding Similarity.coord to counteract the downplaying of scores
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the
> basis of scoring all other expanded terms.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org