You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/05/07 22:12:01 UTC

[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

    [ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533301#comment-14533301 ] 

Adrien Grand commented on LUCENE-329:
-------------------------------------

I like the patch. Maybe we should blend the total term freq just like we blend the doc freq so that it works too with a similarity that uses the ttf instead of the df?

Also it is the 2nd query (the other one is FuzzyLikeThis) where we need to hack a bit TermContext in order to decouple the computation of statistics from the registration of the term states. I'm wondering if we should improve TermContext to make it easier, eg.

{code}
Index: lucene/core/src/java/org/apache/lucene/index/TermContext.java
===================================================================
--- lucene/core/src/java/org/apache/lucene/index/TermContext.java	(revision 1678141)
+++ lucene/core/src/java/org/apache/lucene/index/TermContext.java	(working copy)
@@ -117,16 +117,31 @@
    * should be derived from a {@link IndexReaderContext}'s leaf ord.
    */
   public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
+    register(state, ord);
+    accumulateStatistics(docFreq, totalTermFreq);
+  }
+
+  /**
+   * Expert: Registers and associates a {@link TermState} with an leaf ordinal. The
+   * leaf ordinal should be derived from a {@link IndexReaderContext}'s leaf ord.
+   * On the contrary to {@link #register(TermState, int, int, long)} this method
+   * does NOT update term statistics.
+   */
+  public void register(TermState state, final int ord) {
     assert state != null : "state must not be null";
     assert ord >= 0 && ord < states.length;
     assert states[ord] == null : "state for ord: " + ord
         + " already registered";
+    states[ord] = state;
+  }
+
+  /** Expert: Accumulate term statistics. */
+  public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
     this.docFreq += docFreq;
     if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
       this.totalTermFreq += totalTermFreq;
     else
       this.totalTermFreq = -1;
-    states[ord] = state;
   }
 
   /**
{code}

> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 1.2
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>             Fix For: 5.x
>
>         Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org