You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Naomi Dushay <nd...@stanford.edu> on 2013/09/04 02:54:50 UTC

mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

When I have a field using CJKBigramFilter,  parsed CJK chars have a different parsedQuery than  non-CJK  queries.

  (旧小说 is 3 chars, so 2 bigrams)

args sent in:       q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=

 debugQuery
   <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
   <str name="querystring">{!qf=bi_fld}旧小说</str>
   <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>


If i use a non-CJK query string, with the same field:

args sent in:      q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=

debugQuery:
   <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
   <str name="querystring">{!qf=bi_fld}foo bar</str>
   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)</str>


Why are the  parsedquery_toString   formula different?  And is there any difference in the actual relevancy formula?    

How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n  in the parsedQuery string?


To try to get a handle on qs, ps, tie and mm:

 args:  q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>


I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5    makes sense.   I also get that ps is for boosting of phrases, so I get  (bi_fld:"c d"~4) … but where is   (cjk_uni_pub_search:"a b c d"~4)  ?


Using dismax (instead of edismax):

args:   q={!dismax  qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>


So is this an edismax bug?



FYI,   I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" />
  </analyzer>
</fieldtype>

The request handler uses edismax:

<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>

Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

Posted by Jack Krupansky <ja...@basetechnology.com>.

The query parser sees "q=foo bar" as two separate source query terms and 
analyzes each separately, but "q=旧小说" is seen by the query parser as a 
single source query term and then that one source query term gets tokenized 
by the query term analyzer as two CJK bigrams.

Try "q=foo-bar" and you should then get comparable structure to the 
generated queries.

-- Jack Krupansky

-----Original Message----- 
From: Naomi Dushay
Sent: Tuesday, September 03, 2013 8:54 PM
To: solr-user@lucene.apache.org
Subject: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

When I have a field using CJKBigramFilter,  parsed CJK chars have a 
different parsedQuery than  non-CJK  queries.

  (旧小说 is 3 chars, so 2 bigrams)

args sent in:       q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=

debugQuery
   <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
   <str name="querystring">{!qf=bi_fld}旧小说</str>
   <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 
bi_fld:小说)~2))~0.01) ())/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 
()</str>


If i use a non-CJK query string, with the same field:

args sent in:      q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=

debugQuery:
   <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
   <str name="querystring">{!qf=bi_fld}foo bar</str>
   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) 
DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 
(bi_fld:bar)~0.01)~2)</str>


Why are the  parsedquery_toString   formula different?  And is there any 
difference in the actual relevancy formula?

How can you tell the difference between the MinNrShouldMatch and a qs or ps 
or tie value, if they are all represented as ~n  in the parsedQuery string?


To try to get a handle on qs, ps, tie and mm:

args:  q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) 
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c 
d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>


I get that qs, the query slop, is for explicit phrases in the query, so "a 
b"~5    makes sense.   I also get that ps is for boosting of phrases, so I 
get  (bi_fld:"c d"~4) … but where is   (cjk_uni_pub_search:"a b c d"~4)  ?


Using dismax (instead of edismax):

args:   q={!dismax  qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) 
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c 
d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>


So is this an edismax bug?



FYI,   I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField" 
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="false" />
  </analyzer>
</fieldtype>

The request handler uses edismax:

<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>

Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

Posted by Naomi Dushay <nd...@stanford.edu>.

Re the relevancy changes I note below for edismax, there are already some issues filed:

pertaining to the difference in how the phrase queries are merged into the main query:
  See Michael Dodsworth's comment of 25/Sep/12  on this issue:   https://issues.apache.org/jira/browse/SOLR-2058  <-- ticket is closed, but this issue is not addressed.

and pertaining to skipping terms in phrase boosting when part of the query is a phrase:
  https://issues.apache.org/jira/browse/SOLR-4130

- Naomi


On Sep 3, 2013, at 5:54 PM, Naomi Dushay wrote:

> When I have a field using CJKBigramFilter,  parsed CJK chars have a different parsedQuery than  non-CJK  queries.
> 
>   (旧小说 is 3 chars, so 2 bigrams)
> 
> args sent in:       q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=
> 
>  debugQuery
>    <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
>    <str name="querystring">{!qf=bi_fld}旧小说</str>
>    <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord</str>
>    <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>
> 
> 
> If i use a non-CJK query string, with the same field:
> 
> args sent in:      q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=
> 
> debugQuery:
>    <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
>    <str name="querystring">{!qf=bi_fld}foo bar</str>
>    <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
>    <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)</str>
> 
> 
> Why are the  parsedquery_toString   formula different?  And is there any difference in the actual relevancy formula?    
> 
> How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n  in the parsedQuery string?
> 
> 
> To try to get a handle on qs, ps, tie and mm:
> 
>  args:  q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
> 
> debugQuery:
>   <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
>   <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
>   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
>   <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>
> 
> 
> I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5    makes sense.   I also get that ps is for boosting of phrases, so I get  (bi_fld:"c d"~4) … but where is   (cjk_uni_pub_search:"a b c d"~4)  ?
> 
> 
> Using dismax (instead of edismax):
> 
> args:   q={!dismax  qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
> 
> debugQuery:
>   <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
>   <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
>   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
>   <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>
> 
> 
> So is this an edismax bug?
> 
> 
> 
> FYI,   I am running Solr 4.4. I have fields defined like so:
> <fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>   <analyzer>
>     <tokenizer class="solr.ICUTokenizerFactory" />
>     <filter class="solr.CJKWidthFilterFactory"/>
>     <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>     <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>     <filter class="solr.ICUFoldingFilterFactory"/>
>     <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" />
>   </analyzer>
> </fieldtype>
> 
> The request handler uses edismax:
> 
> <requestHandler name="search" class="solr.SearchHandler" default="true">
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="q.alt">:</str>
> <str name="mm">6<-1 6<90%</str>
> <int name="qs">1</int>
> <int name="ps">0</int>
>