You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Naomi Dushay <nd...@stanford.edu> on 2013/09/04 02:54:50 UTC
mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
When I have a field using CJKBigramFilter, parsed CJK chars have a different parsedQuery than non-CJK queries.
(旧小说 is 3 chars, so 2 bigrams)
args sent in: q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=
debugQuery
<str name="rawquerystring">{!qf=bi_fld}旧小说</str>
<str name="querystring">{!qf=bi_fld}旧小说</str>
<str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>
If i use a non-CJK query string, with the same field:
args sent in: q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=
debugQuery:
<str name="rawquerystring">{!qf=bi_fld}foo bar</str>
<str name="querystring">{!qf=bi_fld}foo bar</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)</str>
Why are the parsedquery_toString formula different? And is there any difference in the actual relevancy formula?
How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n in the parsedQuery string?
To try to get a handle on qs, ps, tie and mm:
args: q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
debugQuery:
<str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>
I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5 makes sense. I also get that ps is for boosting of phrases, so I get (bi_fld:"c d"~4) … but where is (cjk_uni_pub_search:"a b c d"~4) ?
Using dismax (instead of edismax):
args: q={!dismax qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
debugQuery:
<str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>
So is this an edismax bug?
FYI, I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
<filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" />
</analyzer>
</fieldtype>
The request handler uses edismax:
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>
Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
Posted by Jack Krupansky <ja...@basetechnology.com>.
The query parser sees "q=foo bar" as two separate source query terms and
analyzes each separately, but "q=旧小说" is seen by the query parser as a
single source query term and then that one source query term gets tokenized
by the query term analyzer as two CJK bigrams.
Try "q=foo-bar" and you should then get comparable structure to the
generated queries.
-- Jack Krupansky
-----Original Message-----
From: Naomi Dushay
Sent: Tuesday, September 03, 2013 8:54 PM
To: solr-user@lucene.apache.org
Subject: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
When I have a field using CJKBigramFilter, parsed CJK chars have a
different parsedQuery than non-CJK queries.
(旧小说 is 3 chars, so 2 bigrams)
args sent in: q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=
debugQuery
<str name="rawquerystring">{!qf=bi_fld}旧小说</str>
<str name="querystring">{!qf=bi_fld}旧小说</str>
<str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小
bi_fld:小说)~2))~0.01) ())/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01
()</str>
If i use a non-CJK query string, with the same field:
args sent in: q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=
debugQuery:
<str name="rawquerystring">{!qf=bi_fld}foo bar</str>
<str name="querystring">{!qf=bi_fld}foo bar</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01)
DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:foo)~0.01
(bi_fld:bar)~0.01)~2)</str>
Why are the parsedquery_toString formula different? And is there any
difference in the actual relevancy formula?
How can you tell the difference between the MinNrShouldMatch and a qs or ps
or tie value, if they are all represented as ~n in the parsedQuery string?
To try to get a handle on qs, ps, tie and mm:
args: q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
debugQuery:
<str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01)
DisjunctionMaxQuery((bi_fld:c)~0.01)
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c
d"~4)~0.01))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01
(bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>
I get that qs, the query slop, is for explicit phrases in the query, so "a
b"~5 makes sense. I also get that ps is for boosting of phrases, so I
get (bi_fld:"c d"~4) … but where is (cjk_uni_pub_search:"a b c d"~4) ?
Using dismax (instead of edismax):
args: q={!dismax qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
debugQuery:
<str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
<str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01)
DisjunctionMaxQuery((bi_fld:c)~0.01)
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c
d"~4)~0.01))/no_coord</str>
<str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01
(bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>
So is this an edismax bug?
FYI, I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField"
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
<filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true"
katakana="true" hangul="true" outputUnigrams="false" />
</analyzer>
</fieldtype>
The request handler uses edismax:
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>
Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
Posted by Naomi Dushay <nd...@stanford.edu>.
Re the relevancy changes I note below for edismax, there are already some issues filed:
pertaining to the difference in how the phrase queries are merged into the main query:
See Michael Dodsworth's comment of 25/Sep/12 on this issue: https://issues.apache.org/jira/browse/SOLR-2058 <-- ticket is closed, but this issue is not addressed.
and pertaining to skipping terms in phrase boosting when part of the query is a phrase:
https://issues.apache.org/jira/browse/SOLR-4130
- Naomi
On Sep 3, 2013, at 5:54 PM, Naomi Dushay wrote:
> When I have a field using CJKBigramFilter, parsed CJK chars have a different parsedQuery than non-CJK queries.
>
> (旧小说 is 3 chars, so 2 bigrams)
>
> args sent in: q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=
>
> debugQuery
> <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
> <str name="querystring">{!qf=bi_fld}旧小说</str>
> <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 bi_fld:小说)~2))~0.01) ())/no_coord</str>
> <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>
>
>
> If i use a non-CJK query string, with the same field:
>
> args sent in: q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=
>
> debugQuery:
> <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
> <str name="querystring">{!qf=bi_fld}foo bar</str>
> <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
> <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)</str>
>
>
> Why are the parsedquery_toString formula different? And is there any difference in the actual relevancy formula?
>
> How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value, if they are all represented as ~n in the parsedQuery string?
>
>
> To try to get a handle on qs, ps, tie and mm:
>
> args: q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
>
> debugQuery:
> <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
> <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
> <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
> <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>
>
>
> I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5 makes sense. I also get that ps is for boosting of phrases, so I get (bi_fld:"c d"~4) … but where is (cjk_uni_pub_search:"a b c d"~4) ?
>
>
> Using dismax (instead of edismax):
>
> args: q={!dismax qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4
>
> debugQuery:
> <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
> <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
> <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
> <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>
>
>
> So is this an edismax bug?
>
>
>
> FYI, I am running Solr 4.4. I have fields defined like so:
> <fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.ICUTokenizerFactory" />
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" />
> </analyzer>
> </fieldtype>
>
> The request handler uses edismax:
>
> <requestHandler name="search" class="solr.SearchHandler" default="true">
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="q.alt">:</str>
> <str name="mm">6<-1 6<90%</str>
> <int name="qs">1</int>
> <int name="ps">0</int>
>