You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrew Buchanan (JIRA)" <ji...@apache.org> on 2014/01/13 04:34:06 UTC
[jira] [Commented] (SOLR-2649) MM ignored in edismax queries with operators

    [ https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869267#comment-13869267 ] 

Andrew Buchanan commented on SOLR-2649:
---------------------------------------

I'm taking a look at fixing this one.

I've tracked this all the way through the code history and back through the old solr repository. It looks like it was originally submitted this way by Yonik Seeley as SOLR-1553. Any previous history that might explain the reasoning would presumably be in Lucid Imaginations source control system (which I don't have access to). The DisMax parser on which it was based simply used the MM values as passed in, as has been previously noted.

Hoss Man refers to this behavior as a bug at https://issues.apache.org/jira/browse/SOLR-1553?focusedCommentId=12871244&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12871244 on the original SOLR-1553.

If you force doMinMatched = true to disable this logic in ExtendedDismaxQParser, everything seems to work as expected above with the exception of one test case that fails (TestExtendedDismaxParser.testCJKStructured). This test case was added as part of r1406437 by Robert Muir for SOLR-3589 - Edismax parser does not honor mm parameter if analyzer splits a token.

The last query in that test case is "大亚湾 OR bogus" with mm=100% which the test is expecting to evaluate to "+((((standardtok:大 standardtok:亚 standardtok:湾)~3)) (standardtok:bogus))". The comment for the test from Robert Muir indicates that it should "always apply minShouldMatch to the inner booleanqueries created from whitespace, as these are never structured lucene queries but only come from unstructured text". Looking at that query though, it seems to me that it should instead evaluate to "+(((((standardtok:大 standardtok:亚 standardtok:湾)~3)) (standardtok:bogus))~2)", essentially applying the MM to the top level clauses. I'm certainly not a CJK language expert though, so there may be a subtlety here I'm unaware of.

I can put together a patch with some test cases to make this behave as folks here seem to expect, but I would like to get some clarification from Robert if possible on whether he agrees that the existing test case should change...

> MM ignored in edismax queries with operators
> --------------------------------------------
>
>                 Key: SOLR-2649
>                 URL: https://issues.apache.org/jira/browse/SOLR-2649
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>            Reporter: Magnus Bergmark
>            Priority: Minor
>             Fix For: 4.7
>
>
> Hypothetical scenario:
>   1. User searches for "stocks oil gold" with MM set to "50%"
>   2. User adds "-stockings" to the query: "stocks oil gold -stockings"
>   3. User gets no hits since MM was ignored and all terms where AND-ed together
> The behavior seems to be intentional, although the reason why is never explained:
>   // For correct lucene queries, turn off mm processing if there
>   // were explicit operators (except for AND).
>   boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; 
> (lines 232-234 taken from tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
> This makes edismax unsuitable as an replacement to dismax; mm is one of the primary features of dismax.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org