You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org> on 2012/03/04 00:45:57 UTC

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221737#comment-13221737 ] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I understand the problem. 

It all has to do - as Robert mentioned - with the repeating terms in the phrase query. I am working on a patch - it will change the way that repeats are handled. 

Repeating PPs require additional computation - and current SloppyPhraseScorer attempts to do that additional work efficiently, but over simplifies in that and fail to cover all cases. 

In the core of things, each time a repeating PP is selected (from the queue) and  propagated, *all* its sibling repeaters are propagated as well, to prevent a case that two repeating PPs point to the same document position (which was the bug that originally triggered handling repeats in this code). 

But this is wrong, because it propagates all siblings repeaters, and misses some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443]) ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i am missing something. 

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document<stored,indexed,tokenized<field:s o b h j t j z o>>
{noformat}

It seems that q1 too should not match this document?
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org