You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2011/06/14 23:19:10 UTC

ampersand, dismax, combining two fields, one of which is keywordTokenizer

I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, the 
other fields in qf will hit or not, that's good enough.

And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits when 
it shoudln't, and I can't figure out why.

basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field

gets hits.  The ":" is thrown out the text_field, but the mm still 
passes somehow, right?

But, in the same index:

&defType=dismax&mm=100%&q=one : two&qf=text_field 
keyword_tokenized_text_field

gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the mm 
differently, decide there are three tokens in there and they all must 
match, and the token ":" can never match because it's not in my index 
it's stripped out... but somehow this isn't a problem unless I include a 
keyword-tokenized  field in the qf?

This is really confusing, if anyone has any idea what I'm talking about 
it and can shed any light on it, much appreciated.

The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was useful 
for certain use cases for me.

Oh, hey, the debugging trace woudl probably be useful:


<lstname="debug">
<strname="rawquerystring">
churchill : roosevelt
</str>
<strname="querystring">
churchill : roosevelt
</str>
<strname="parsedquery">
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
text:"churchil roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 
| author_unstem:"churchill roosevelt"~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill 
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 | 
other_number_unstem:"churchill roosevelt"~3^40.0 | 
subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01)
</str>
<strname="parsedquery_toString">
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
author_unstem:"churchill roosevelt"~3^400.0 | title_exactmatch:churchill 
roosevelt^500.0 | title1_t:"churchil roosevelt"~3^60.0 | 
title1_unstem:"churchill roosevelt"~3^320.0 | author2_unstem:"churchill 
roosevelt"~3^240.0 | title3_unstem:"churchill roosevelt"~3^80.0 | 
subject_t:"churchil roosevelt"~3^10.0 | other_number_unstem:"churchill 
roosevelt"~3^40.0 | subject_unstem:"churchill roosevelt"~3^80.0 | 
title_series_t:"churchil roosevelt"~3^40.0 | 
title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01
</str>
<lstname="explain"/>
<strname="QParser">
DisMaxQParser
</str>
<nullname="altquerystring"/>
<nullname="boostfuncs"/>
<lstname="timing">
<doublename="time">
6.0
</double>
<lstname="prepare">
<doublename="time">
3.0
</double>
<lstname="org.apache.solr.handler.component.QueryComponent">
<doublename="time">
2.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.FacetComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.HighlightComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.StatsComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.SpellCheckComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.DebugComponent">
<doublename="time">
0.0
</double>
</lst>
</lst>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Chris Hostetter <ho...@fucit.org>.

: Maybe what I really need is a query parser that does not do "disjunction
: maximum" at all, but somehow still combines different 'qf' type fields with
: different boosts on each field. I personally don't _neccesarily_ need the
: actual "disjunction max" calculation, but I do need combining of mutiple
: fields with different boosts. Of course, I'm not sure exactly how it would
: combine multiple fields if not "disjunction maximum", but perhaps one is
: conceivable that wouldn't be subject to this particular gotcha with differing
: analysis.

you can sort of do that today, something like this should work...

q = _query_:"$q1"^100 _query_:"$q2"^10 _query_:"$q3"^5 _query_:"$q4"
q1 = {!lucene df=title v=$qq}
q2 = {!lucene df=summary v=$qq}
q3 = {!lucene df=author v=$qq}
q4 = {!lucene df=body v=$qq}
qq = ...user input here...

..but you might want to replace "lucene" with "field" depending on what
metacharacters you want to support.

in general though the reason i wrote the dismax parser (instead of a
parser that works like this) is because of how a multiword queries wind up
matching/scoring. A guy named Chuck Williams wrote the earliest
versoin of the DisjunctionMaxQuery class and his "albino elephant"
example totally sold me on this approach back in 2005...

http://www.lucidimagination.com/search/document/8ce795c4b6752a1f/contribution_better_multi_field_searching
https://issues.apache.org/jira/browse/LUCENE-323

: I also remain kind of confused about how the existing dismax figures out "how
: many terms" for the 'mm' type calculations. If someone wanted to explain that,
: I would find it enlightening and helpful for understanding what's going on.

it's not really about terms -- it's just the total number of clauses in
the outer BooleanQuery that it builds. if a chunk of input produces a
valid DisjunctionMaxQuery (because the analyzer for at least one qf field
generated tokens) then that's a clause, if a chunk of input doesn't
produce a token (because none of hte analyzers from any of the qf ields
generated tokens) then that's not a clause.

-Hoss