You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2012/06/28 18:18:43 UTC

edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

Hello,

My previous e-mail with a CJK example has received no replies.   I verified
that this problem also occurs for English.  For example in the case of the
word "fire-fly" , The ICUTokenizer and the WordDelimeterFilter both split
this into two tokens "fire" and "fly".

With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
words are entered separately at [fire fly], the edismax parser honors the
mm parameter and does the equivalent of a Boolean AND query.  However if
the words are entered as a hypenated word [fire-fly], the tokenizer splits
these into two tokens "fire" and "fly" and the edismax parser does the
equivalent of a Boolean OR query.

I'm not sure I understand the output of the debugQuery, but judging by the
number of hits returned it appears that edismax is not honoring the mm
parameter. Am I missing something, or is this a bug?

 I'd like to file a JIRA issue, but want to find out if I am missing
something here.

Details of several queries are appended below.

Tom Burton-West

edismax query mm=2   query with hypenated word [fire-fly]

<lst name="debug">
<str name="rawquerystring">{!edismax mm=2}fire-fly</str>
<str name="querystring">{!edismax mm=2}fire-fly</str>
<str name="parsedquery">+DisjunctionMaxQuery(((ocr:fire ocr:fly)))</str>
<str name="parsedquery_toString">+((ocr:fire ocr:fly))</str>


Entered as separate words [fire fly]  numFound="184962
 edismax mm=2
<lst name="debug">
<str name="rawquerystring">{!edismax mm=2}fire fly</str>
<str name="querystring">{!edismax mm=2}fire fly</str>
<str name="parsedquery">
+((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
</str


Regular Boolean AND query:   [fire AND fly] numFound="184962
<str name="rawquerystring">fire AND fly</str>
<str name="querystring">fire AND fly</str>
<str name="parsedquery">+ocr:fire +ocr:fly</str>
<str name="parsedquery_toString">+ocr:fire +ocr:fly</str>

Regular Boolean OR query: fire OR fly 366047  numFound="366047"
<lst name="debug">
<str name="rawquerystring">fire OR fly</str>
<str name="querystring">fire OR fly</str>
<str name="parsedquery">ocr:fire ocr:fly</str>
<str name="parsedquery_toString">ocr:fire ocr:fly</str>

Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

Posted by Tom Burton-West <tb...@umich.edu>.
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which
also lists a couple other related mailing list posts.




On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West <tb...@umich.edu>wrote:

> Hello,
>
> My previous e-mail with a CJK example has received no replies.   I
> verified that this problem also occurs for English.  For example in the
> case of the word "fire-fly" , The ICUTokenizer and the WordDelimeterFilter
> both split this into two tokens "fire" and "fly".
>
> With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
> words are entered separately at [fire fly], the edismax parser honors the
> mm parameter and does the equivalent of a Boolean AND query.  However if
> the words are entered as a hypenated word [fire-fly], the tokenizer splits
> these into two tokens "fire" and "fly" and the edismax parser does the
> equivalent of a Boolean OR query.
>
> I'm not sure I understand the output of the debugQuery, but judging by the
> number of hits returned it appears that edismax is not honoring the mm
> parameter. Am I missing something, or is this a bug?
>
>  I'd like to file a JIRA issue, but want to find out if I am missing
> something here.
>
> Details of several queries are appended below.
>
> Tom Burton-West
>
> edismax query mm=2   query with hypenated word [fire-fly]
>
> <lst name="debug">
> <str name="rawquerystring">{!edismax mm=2}fire-fly</str>
> <str name="querystring">{!edismax mm=2}fire-fly</str>
> <str name="parsedquery">+DisjunctionMaxQuery(((ocr:fire ocr:fly)))</str>
> <str name="parsedquery_toString">+((ocr:fire ocr:fly))</str>
>
>
> Entered as separate words [fire fly]  numFound="184962
>  edismax mm=2
> <lst name="debug">
> <str name="rawquerystring">{!edismax mm=2}fire fly</str>
> <str name="querystring">{!edismax mm=2}fire fly</str>
> <str name="parsedquery">
> +((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
> </str
>
>
> Regular Boolean AND query:   [fire AND fly] numFound="184962
> <str name="rawquerystring">fire AND fly</str>
> <str name="querystring">fire AND fly</str>
> <str name="parsedquery">+ocr:fire +ocr:fly</str>
> <str name="parsedquery_toString">+ocr:fire +ocr:fly</str>
>
> Regular Boolean OR query: fire OR fly 366047  numFound="366047"
> <lst name="debug">
> <str name="rawquerystring">fire OR fly</str>
> <str name="querystring">fire OR fly</str>
> <str name="parsedquery">ocr:fire ocr:fly</str>
> <str name="parsedquery_toString">ocr:fire ocr:fly</str>
>