You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2015/06/25 03:13:05 UTC

[jira] [Created] (LUCENE-6603) consider restrictions on what Similarity.coord() can return

Robert Muir created LUCENE-6603:
-----------------------------------

             Summary: consider restrictions on what Similarity.coord() can return
                 Key: LUCENE-6603
                 URL: https://issues.apache.org/jira/browse/LUCENE-6603
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Robert Muir


Today, Similarity.coord() can really return anything, though all of our similarities are well-behaved, issues are all about custom ones:

{code}
   * @param overlap the number of query terms matched in the document
   * @param maxOverlap the total number of terms in the query
   * @return a score factor based on term overlap with the query
   */
  public float coord(int overlap, int maxOverlap)
{code}

But problems arise when a custom similarity implements {{coord(1, 1)}} to return something crazy (say 2). In this case their coord impl is ignored (see LUCENE-4300). {{coord(1,1)}} is always treated as 1, or things make no sense, for example A NOT B would score differently from A (which would be a simple termquery and have no BQ around it). Same goes with filters, which should not change scoring but would, if we didn't mandate this.

Now we see the same problem again, with LUCENE-6585. For this optimization to work in the current world, it will have to check if {{coord(N,N)}} == 1 in order to do it safely. Otherwise it cannot safely collapse conjunctions for such custom similarities.

I would like to enforce that {{coord(N,N)}} is always treated as 1, to prevent all these crazy codepaths for wierd, not-so-well-tested cases. So we could change the current hack in BooleanWeight.coord():
{code}
--   } else if (maxOverlap == 1) {
++   } else if (overlap == maxOverlap) {
   return 1f;
{code}

Alternatively though, we could change javadocs of Similarity.coord() and add a check to BooleanWeight to throw an exception.

In either case, doing this would be a break, because it would break some custom sims out there, at least this one (https://mail-archives.apache.org/mod_mbox/lucene-java-user/201208.mbox/%3CF00509B7-8C6B-4496-951B-E89B168D91A2@local.ch%3E). That one returns {{1/overlap}} which is kinda like averaging the scores, and caused us to look into this and open bugs like LUCENE-4297 and LUCENE-4300 and probably others. But perhaps coord() is not the way such things should be done and there is another way instead, to allow BQ to be simpler and more efficient.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org