You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2011/06/14 23:19:10 UTC

ampersand, dismax, combining two fields, one of which is keywordTokenizer

I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, the 
other fields in qf will hit or not, that's good enough.

And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits when 
it shoudln't, and I can't figure out why.

basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field

gets hits.  The ":" is thrown out the text_field, but the mm still 
passes somehow, right?

But, in the same index:

&defType=dismax&mm=100%&q=one : two&qf=text_field 
keyword_tokenized_text_field

gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the mm 
differently, decide there are three tokens in there and they all must 
match, and the token ":" can never match because it's not in my index 
it's stripped out... but somehow this isn't a problem unless I include a 
keyword-tokenized  field in the qf?

This is really confusing, if anyone has any idea what I'm talking about 
it and can shed any light on it, much appreciated.

The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was useful 
for certain use cases for me.

Oh, hey, the debugging trace woudl probably be useful:


<lstname="debug">
<strname="rawquerystring">
churchill : roosevelt
</str>
<strname="querystring">
churchill : roosevelt
</str>
<strname="parsedquery">
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
text:"churchil roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 
| author_unstem:"churchill roosevelt"~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill 
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 | 
other_number_unstem:"churchill roosevelt"~3^40.0 | 
subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01)
</str>
<strname="parsedquery_toString">
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
author_unstem:"churchill roosevelt"~3^400.0 | title_exactmatch:churchill 
roosevelt^500.0 | title1_t:"churchil roosevelt"~3^60.0 | 
title1_unstem:"churchill roosevelt"~3^320.0 | author2_unstem:"churchill 
roosevelt"~3^240.0 | title3_unstem:"churchill roosevelt"~3^80.0 | 
subject_t:"churchil roosevelt"~3^10.0 | other_number_unstem:"churchill 
roosevelt"~3^40.0 | subject_unstem:"churchill roosevelt"~3^80.0 | 
title_series_t:"churchil roosevelt"~3^40.0 | 
title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01
</str>
<lstname="explain"/>
<strname="QParser">
DisMaxQParser
</str>
<nullname="altquerystring"/>
<nullname="boostfuncs"/>
<lstname="timing">
<doublename="time">
6.0
</double>
<lstname="prepare">
<doublename="time">
3.0
</double>
<lstname="org.apache.solr.handler.component.QueryComponent">
<doublename="time">
2.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.FacetComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.HighlightComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.StatsComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.SpellCheckComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.DebugComponent">
<doublename="time">
0.0
</double>
</lst>
</lst>




Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Chris Hostetter <ho...@fucit.org>.
: Maybe what I really need is a query parser that does not do "disjunction
: maximum" at all, but somehow still combines different 'qf' type fields with
: different boosts on each field. I personally don't _neccesarily_ need the
: actual "disjunction max" calculation, but I do need combining of mutiple
: fields with different boosts. Of course, I'm not sure exactly how it would
: combine multiple fields if not "disjunction maximum", but perhaps one is
: conceivable that wouldn't be subject to this particular gotcha with differing
: analysis.

you can sort of do that today, something like this should work...

 q  = _query_:"$q1"^100 _query_:"$q2"^10 _query_:"$q3"^5 _query_:"$q4"
 q1 = {!lucene df=title v=$qq}
 q2 = {!lucene df=summary v=$qq}
 q3 = {!lucene df=author v=$qq}
 q4 = {!lucene df=body v=$qq}
 qq = ...user input here...

..but you might want to replace "lucene" with "field" depending on what 
metacharacters you want to support.

in general though the reason i wrote the dismax parser (instead of a
parser that works like this) is because of how a multiword queries wind up 
matching/scoring.  A guy named Chuck Williams wrote the earliest 
versoin of the DisjunctionMaxQuery class and his "albino elephant" 
example totally sold me on this approach back in 2005...

http://www.lucidimagination.com/search/document/8ce795c4b6752a1f/contribution_better_multi_field_searching
https://issues.apache.org/jira/browse/LUCENE-323

: I also remain kind of confused about how the existing dismax figures out "how
: many terms" for the 'mm' type calculations. If someone wanted to explain that,
: I would find it enlightening and helpful for understanding what's going on.

it's not really about terms -- it's just the total number of clauses in 
the outer BooleanQuery that it builds.  if a chunk of input produces a 
valid DisjunctionMaxQuery (because the analyzer for at least one qf field 
generated tokens) then that's a clause, if a chunk of input doesn't 
produce a token (because none of hte analyzers from any of the qf ields 
generated tokens) then that's not a clause.


-Hoss

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Yeah, I see your points. It's complicated. I'm not sure either.

But the thing is:

 > in order to use a feature like that you'd have to really think hard 
about
 > the query analysis of your fields, and which ones will produce which
 > tokens in which situations

You need to think really hard about the (index and query) analysis of 
your fields and which ones will produce which tokens _now_, if you are 
using multiple fields in a 'qf' with differing analysis, and using a 
percent mm. (Or similarly an mm that varies depending on how many terms).

That's what I've come to realize, that's the status quo. If your qf 
fields don't all have identical analysis, right _now_ you need to think 
really hard about the analysis and how it's going to possibly effect 
'mm', including for edge case queries.  If you don't, you likely have 
edge case queries (at least) which aren't behaving how you expected 
(whether you notice or have it brought to your attention by users or not).

Or you can just make sure all fields in your qf have identical analysis, 
and then you don't have to worry about it. But that's not always 
practical, a lot of the power of dismax qf ends up being combining 
fields with different analysis.

So I was trying to think of a way to make this less so, but still be 
able to take advantage of dismax, but I think you're right that maybe 
there isn't any, or at least nothing we've come up with yet.

Maybe what I really need is a query parser that does not do "disjunction 
maximum" at all, but somehow still combines different 'qf' type fields 
with different boosts on each field. I personally don't _neccesarily_ 
need the actual "disjunction max" calculation, but I do need combining 
of mutiple fields with different boosts. Of course, I'm not sure exactly 
how it would combine multiple fields if not "disjunction maximum", but 
perhaps one is conceivable that wouldn't be subject to this particular 
gotcha with differing analysis.

I also remain kind of confused about how the existing dismax figures out 
"how many terms" for the 'mm' type calculations. If someone wanted to 
explain that,  I would find it enlightening and helpful for 
understanding what's going on.

Jonathan

On 6/21/2011 10:20 PM, Chris Hostetter wrote:
> : not other) setups/intentions.  It's counter-intuitive to me that adding
> : a field to the 'qf' set results in _fewer_ hits than the same 'qf' set
>
> agreed .. but that's where looking the debug info comes in to understand
> the reason for that behavior is that your old qf treated part of your
> input as garbage and that new field respects it and uses it in the
> calculation.
>
> mind you: the "fewer hits" behavior only happens when using a percentage
> value in mm ... if you had mm=2 you'd get more results, but you've asked
> for "66%" (or whatever) and with that new qf there is a differnet number
> of clauses produced by query parsing.
>
> : I wonder if it would be a good idea to have a parameter to (e)dismax
> : that told it which of these two behaviors to use? The one where the
> : 'term count' is based on the maximum number of terms from any field in
> : the 'qf', and one where it's based on the minimum number of terms
> : produced from any field in the qf?  I am still not sure how feasible
>
> even in your use case, i don't think you are fully considering what that
> would produce.  imagine that an mmType=min param existed and gave you what
> you're asking for.  Now imagine that you have two fields, one named
> "simple" that strips all punctuation and one named "complex" that doesn't,
> and you have a query like this...
>
> 	q=Foo&  Bar
> 	qf=simple complex
> 	mm=100%
> 	mmType=min
>
>    * Foo produces tokens for all qf
>    *&  only produces tokens for some qf (complex)
>    * Bar products tokens for all qf
>
> your mmType would say "there are only 2 tokens that we can query across
> all fields, so our computed minShouldMatch should be 100% of 2 == 2"
>
> sounds good so far right?
>
> the problem is you still have query clause coming from that "&"
> character ... you have 3 real clauses, one of which is that term query for
> "complex:&" which means that with your (computed) minShouldMatch of 2 you
> would see matches for any doc that happened to have indexed the "&" symbol
> in the "complex" field and also matched *either* of Foo or Bar (in either
> field)
>
> So while a lot of your results would match both Foo and Bar, you'd get
> still get a bunch of weird results.
>
> : Or maybe a feature where you tell dismax, the number of tokens produced
> : by field X, THAT's the one you should use for your 'term count' for mm,
>
> Hmmm.... maybe.  i'd have to see a patch in action and play with it, to
> really think it through ... hmmm ... honestly i really can't imagine how
> that would be helpful in general...
>
> in order to use a feature like that you'd have to really think hard about
> the query analysis of your fields, and which ones will produce which
> tokens in which situations in order to make sure you pick the *right*
> value for that param -- but once you've done that hard thinking you might
> as well feed it back into your schema.xml and say "the query analyzer for
> field 'complex' should prune any tokens that only contain punctuation"
> (instead of saying "'complex' will produce tokens that only contain
> punctuation, so lets tell dismax to compute mm based only on 'simple').
> Afterall, there might not be one single field that you can pick -- maybe
> 'complex' lets tokens that are all punctuation through but strips
> stopwords, and maybe 'simple' does the opposite ... no param value you
> pick will help you with that possibility, you really just need to fix the
> query analyzers to make sense if you want to use both of those two fields
> in the qf.
>
>
> -Hoss
>

RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Chris Hostetter <ho...@fucit.org>.
: not other) setups/intentions.  It's counter-intuitive to me that adding 
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set 

agreed .. but that's where looking the debug info comes in to understand 
the reason for that behavior is that your old qf treated part of your 
input as garbage and that new field respects it and uses it in the 
calculation.

mind you: the "fewer hits" behavior only happens when using a percentage 
value in mm ... if you had mm=2 you'd get more results, but you've asked 
for "66%" (or whatever) and with that new qf there is a differnet number 
of clauses produced by query parsing.

: I wonder if it would be a good idea to have a parameter to (e)dismax 
: that told it which of these two behaviors to use? The one where the 
: 'term count' is based on the maximum number of terms from any field in 
: the 'qf', and one where it's based on the minimum number of terms 
: produced from any field in the qf?  I am still not sure how feasible 

even in your use case, i don't think you are fully considering what that 
would produce.  imagine that an mmType=min param existed and gave you what 
you're asking for.  Now imagine that you have two fields, one named 
"simple" that strips all punctuation and one named "complex" that doesn't, 
and you have a query like this...

	q=Foo & Bar
	qf=simple complex
	mm=100%
	mmType=min

  * Foo produces tokens for all qf
  * & only produces tokens for some qf (complex)
  * Bar products tokens for all qf

your mmType would say "there are only 2 tokens that we can query across 
all fields, so our computed minShouldMatch should be 100% of 2 == 2"

sounds good so far right?

the problem is you still have query clause coming from that "&" 
character ... you have 3 real clauses, one of which is that term query for 
"complex:&" which means that with your (computed) minShouldMatch of 2 you 
would see matches for any doc that happened to have indexed the "&" symbol 
in the "complex" field and also matched *either* of Foo or Bar (in either 
field)

So while a lot of your results would match both Foo and Bar, you'd get 
still get a bunch of weird results.

: Or maybe a feature where you tell dismax, the number of tokens produced 
: by field X, THAT's the one you should use for your 'term count' for mm, 

Hmmm.... maybe.  i'd have to see a patch in action and play with it, to 
really think it through ... hmmm ... honestly i really can't imagine how 
that would be helpful in general...

in order to use a feature like that you'd have to really think hard about 
the query analysis of your fields, and which ones will produce which 
tokens in which situations in order to make sure you pick the *right* 
value for that param -- but once you've done that hard thinking you might 
as well feed it back into your schema.xml and say "the query analyzer for 
field 'complex' should prune any tokens that only contain punctuation" 
(instead of saying "'complex' will produce tokens that only contain 
punctuation, so lets tell dismax to compute mm based only on 'simple').  
Afterall, there might not be one single field that you can pick -- maybe 
'complex' lets tokens that are all punctuation through but strips 
stopwords, and maybe 'simple' does the opposite ... no param value you 
pick will help you with that possibility, you really just need to fix the 
query analyzers to make sense if you want to use both of those two fields 
in the qf.


-Hoss

RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Thanks, that's helpful. 

It still seems like current behavior does the "wrong" thing in _many_ cases (I know a lot of people get tripped up by it, sometimes on this list) -- but I understand your cases where it does the right thing, and where what I'm suggesting would be the wrong thing. 

> Ultimately the problem you had with "&" is the same problem people have 
> with stopwords, and comes down to the same thing: if you don't want some 
> chunk of text to be "significant" when searchng a field in your qf, have 
> your analyzer remove it 

Ah, but see the problem people have with stopwords is when they actually DID that. They didn't want a term to be 'significant' in one field, but they DID want it to be 'significant' in another field... but how this effects the 'mm' ends up being kind of counter-intuitive for some (but not other) setups/intentions.   It's counter-intuitive to me that adding a field to the 'qf' set results in _fewer_ hits than the same 'qf' set without the new field -- although I understand your cases where you added the field to the 'qf' precisely in order to intentionally get that behavior, that's definitely not a universal case. 

And the fact that unpredictable changes to field analysis that aren't as simple as stopwords can lead to this same problem (as in this case where one field ignores punctuation and the other doesn't) -- it's definitely a trap waiting for some people. 

I wonder if it would be a good idea to have a parameter to (e)dismax that told it which of these two behaviors to use? The one where the 'term count' is based on the maximum number of terms from any field in the 'qf', and one where it's based on the minimum number of terms produced from any field in the qf?  I am still not sure how feasible THAT is, but it seems like a good idea to me. The current behavior is definitely a pitfall for many people.  

Or maybe a feature where you tell dismax, the number of tokens produced by field X, THAT's the one you should use for your 'term count' for mm, all the other fields are really just in there as sort of supplementary -- for boosting, or for bringing a few more results in; but NOT the case where you intentionally add a 'qf' with KeepWordsFilter in order to intentionally _reduce_ the result set . I think that's a pretty common use case too. 

Jonathan

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Chris Hostetter <ho...@fucit.org>.
: It seems like the problem is when different fields in the 'qf' produce a
: different number of tokens for a given query.  dismax needs to know the number
: of tokens in the input in order to calculate 'mm', when 'mm' is expressed as a
: percentage, or when different mm's are given for different numbers of input
: tokens.

actually the fundmental problem is that when this situation arrises, 
dismax has no way of knowing *if* you want the token that only produced a 
TermQuery in fieldA but not fieldB to counted at all.

In your case, you don't want the "&" query against your simple (non 
whitespace striping) field to count in computing minShouldMax, but how 
does dismax know that?

if someone has a field that not only strips out punctuation, but also 
ignores anything that doesn't match one of my known keywords (using the 
KeepWordsFilter) they woud want the exact oposite situation as you -- they 
are really counting on the cases where a token produces a valid query for 
that special field to be a factor, don't want the number of clauses used 
to compute minShouldMatch to be lowered artificially just all the other 
tokens in the input don't don't produce anything for that field.

bottom line: as long as one field produces a token for a chunk of input, 
that's a clause -- it may only be a clause that's queried against one 
field, but it's still a clause.

: So what if dismax could recognize that different fields were producing
: different arrity of input, and use the _smallest_ number for it's 'mm'
: calculations, instead of current behavior where it's effectively the largest
: number? (Or '1' if the smallest number is '0'?!) That would in some cases
: produce errors in the other direction -- more hits coming back than you
: naively/intuitively expect.   Not sure if that would be worse or better. Seems
: better to me, less bad failure mode.

consider my previous example, and something similar to Jira searching 
where you might have a "projectCode" field with a query time 
KeepWordsFilter that only matches project codes ... right now, a query 
like q=SOLR+foo+bar+baz&mm=100%&wf=productCode^100+text would give you 
some really nice results that match all the input, but if SOLR is a 
projectCode those issues bubble to the top -- with your proposal, the 
effective mm would be "1" (because the projectCode field would only wind 
up with the SOLR clause) and you'd get all sorts of crap -- because those 
other clauses are all still there.  so you'd get *all* project:Solr 
issues, and *all* issues matching text:foo, and *all* issues matching 
text:bar etc...

: Or better yet, but surely harder perhaps infeasible to code, it would somehow
: apply the 'mm' differently to each field. Not even sure what that means

That's pretty much impossible.  the whole nature of the dismax style 
parser is that a DisjunctionMaxQuery is computed for each "word" of the 
q, across all "fields" in the qf -- it's those DisjunctionMaxQueries that 
are wrapped in a BooleanQuery with minShouldMatch set on it...

	http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

...if you "fliped" that matrix along the diagonal to hvae a differnet mm 
per field, you'd lose the value of the field specific boosts.


Ultimately the problem you had with "&" is the same problem people have 
with stopwords, and comes down to the same thing: if you don't want some 
chunk of text to be "significant" when searchng a field in your qf, have 
your analyzer remove it -- if the analyzer for a field in the qf produces 
a token, dismax assumes it's significant to the query and factors into the 
mm and matching and scoring.


-Hoss

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Thanks. I'm trying to think through if there's any hypothetical way for 
dismax to be improved to not be subject to this problem.  Now that it's 
clear that the problem isn't just with stopwords, and that in fact it's 
very hard to predict if you'll get the problem and under what input, 
when creating your schema and 'qf' list.... it seems a worse problem 
than it did when it was thought of as just stopwords-related.

Of course, I'm trying to think through this without actually 
understanding the dismax code at all, just based on what I know of how 
dismax works from black box observation.

It seems like the problem is when different fields in the 'qf' produce a 
different number of tokens for a given query.  dismax needs to know the 
number of tokens in the input in order to calculate 'mm', when 'mm' is 
expressed as a percentage, or when different mm's are given for 
different numbers of input tokens.

Somehow dismax  gets at this number now, based on the actual field 
analysis, not just whitespace-splitting at the query parser level.  
Because if I issue query "roosevelt & churchill", and ALL the fields 
involved have analysis that turns this into just two tokens 
['roosevlet', 'churchill'], then dismax does the right thing, 
recognizing two terms in the input. The problem is when some of the 
fields produce two tokens from that input, and others produce three --- 
dismax, I think, then decides there are three terms in input, but in at 
least some fields those 'three' terms can't possibly all match.

So what if dismax could recognize that different fields were producing 
different arrity of input, and use the _smallest_ number for it's 'mm' 
calculations, instead of current behavior where it's effectively the 
largest number? (Or '1' if the smallest number is '0'?!) That would in 
some cases produce errors in the other direction -- more hits coming 
back than you naively/intuitively expect.   Not sure if that would be 
worse or better. Seems better to me, less bad failure mode.

Or better yet, but surely harder perhaps infeasible to code, it would 
somehow apply the 'mm' differently to each field. Not even sure what 
that means exactly. But somehow an mm of 100% means two terms in the 
field that analysis to 2 OR three terms in the field that analyses to 
3... man, that's a mess.  Okay, stick with the first idea.

But I've got no idea how feasible that is to code, and I personally have 
no time to figure out how to code it, and nobody else is likely to since 
this problem is unlikely to be a high priority for solr committers.... 
so, I dunno.

On 6/15/2011 3:46 PM, Erick Erickson wrote:
> Jonathan:
>
> Thanks for writing that up, you're right, it is arcane....
>
> I've starred this one!
>
> Erick
>
>> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
>> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>>
>> So to understand, first familiarize yourself with that.
>>
>> However, none of the fields involved here had any stopwords at all, so at
>> first it wasn't obvious this was the problem. But having different
>> tokenization and other analysis between fields can result in exactly the
>> same problem, for certain queries.
>>
>> One field in the dismax qf used an analyzer that stripped punctuation. (I'm
>> actually not positive at this point _which_ analyzer in my chain was
>> stripping punctuation, I'm using a bunch including some custom ones, but I
>> was aware that punctuation was being stripped, this was intentional.)
>>
>> So "monkey's" turns into "monkey".  "monkey:" turns into "monkey".  So far
>> so good. But what happens if you have punctuation all by itself seperated by
>> whitespace?  "Roosevlet&  Churchill" turns into ['roosevelt', 'churchill'].
>>   That ampersand in the middle was stripped out, essentially _just as if_ it
>> were a stopword. Only two tokens result from that input.
>>
>> You can see where this is going -- another field involved in the dismax qf
>> did NOT strip out punctuation. So three tokens result from that input,
>> ['Roosevelt', '&', 'Churchill'].
>>
>> Now we have exactly the situation that gives ride the dismax stopwords
>> mm-behaving-funny situation, it's exactly the same thing.
>>
>> Now I've fixed this for punctuation just by making those fields strip out
>> punctuation, by adding these analyzers to the bottom of those
>> previously-not-stripping-punctuation field definitions:
>>
>> <!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
>> <filter class="solr.PatternReplaceFilterFactory"
>>                 pattern="([\p{Punct}])" replacement="" replace="all"
>>         />
>> <!-- if after stripping punc we have any 0-length tokens, make
>>               sure to eliminate them. We can use LengthFilter min=1 for that,
>>               we dont' care about the max here, just a very large number. -->
>> <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>
>>
>> And things are working are how I expect again, at least for this punctuation
>> issue. But there may be other edge cases where differences in analysis
>> result in different number of tokens from different fields, which if they
>> are both included in a dismax qf, will have bad effects on 'mm'.
>>
>> The lesson I think, is that the only absolute safe way to use dismax 'mm',
>> is when all fields in the 'qf' have exactly the same analysis.  But
>> obviously that's not very practical, it destroys much of the power of
>> dismax. And some differences in analysis are certainly acceptable -- but
>> it's rather tricky to figure out if your differences in analysis are going
>> to be significant for this problem, under what input, and if so fix them. It
>> is not an easy thing to do.  So dismax definitely has this gotcha
>> potentially waiting for you, whenever mixing fields with different analysis
>> in a 'qf'.
>>
>>
>> On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
>>> Okay, let's try the debug trace again without a pf to be less confusing.
>>>
>>> One field in qf, that's ordinary text tokenized, and does get hits:
>>>
>>>
>>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>>>
>>> <str name="rawquerystring">churchill : roosevelt</str>
>>> <str name="querystring">churchill : roosevelt</str>
>>> <str name="parsedquery">
>>> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
>>> </str>
>>> <str name="parsedquery_toString">
>>> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
>>> </str>
>>>
>>> And that gets 25 hits. Now we add in a second field to the qf, this second
>>> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
>>> adding another field into qf, right? And indeed it still results in exactly
>>> 25 hits (no additional hits from the additional qf field).
>>>
>>>
>>> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>>>
>>> <str name="parsedquery">
>>> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
>>> </str>
>>> <str name="parsedquery_toString">
>>> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
>>> title1_t:roosevelt)~0.01)~2) ()
>>> </str>
>>>
>>>
>>>
>>> Okay, now we go back to just that first (ordinarily tokenized) field, but
>>> add a second field in that uses KeywordTokenizerFactory.  We expect this not
>>> neccesarily to ever match for a multi-word query, but we don't expect it to
>>> be fewer than 25 hits, the 25 hits from the first field in the qf should
>>> still be there, right? But it's not. What happened, why not?
>>>
>>>
>>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>>>
>>>
>>> str name="rawquerystring">churchill : roosevelt</str>
>>> <str name="querystring">churchill : roosevelt</str>
>>> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
>>> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
>>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>>> ()</str>
>>> <str name="parsedquery_toString">+(((isbn_t:churchill |
>>> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
>>> title1_t:roosevelt)~0.01)~3) ()</str>
>>>
>>>
>>>
>>> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>>>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in
>>>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a
>>>> whitespace-containing query is entered).  But I do it anyway, for cases
>>>> where a non-whitespace-containing query is entered, then it hits.  And in
>>>> those cases where it doesn't hit, I figure okay, well, the other fields in
>>>> qf will hit or not, that's good enough.
>>>>
>>>> And usually that works. But it works _differently_ when my query contains
>>>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
>>>> and I can't figure out why.
>>>>
>>>> basically,
>>>>
>>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>>
>>>> gets hits.  The ":" is thrown out the text_field, but the mm still passes
>>>> somehow, right?
>>>>
>>>> But, in the same index:
>>>>
>>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>> keyword_tokenized_text_field
>>>>
>>>> gets 0 hits.  Somehow maybe the inclusion of the
>>>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm
>>>> differently, decide there are three tokens in there and they all must match,
>>>> and the token ":" can never match because it's not in my index it's stripped
>>>> out... but somehow this isn't a problem unless I include a keyword-tokenized
>>>>   field in the qf?
>>>>
>>>> This is really confusing, if anyone has any idea what I'm talking about
>>>> it and can shed any light on it, much appreciated.
>>>>
>>>> The conclusion I am reaching is just NEVER include anything but a more or
>>>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
>>>> certain use cases for me.
>>>>
>>>> Oh, hey, the debugging trace woudl probably be useful:
>>>>
>>>>
>>>> <lstname="debug">
>>>> <strname="rawquerystring">
>>>> churchill : roosevelt
>>>> </str>
>>>> <strname="querystring">
>>>> churchill : roosevelt
>>>> </str>
>>>> <strname="parsedquery">
>>>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>>>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
>>>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
>>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>>> roosevelt"~3^80.0)~0.01)
>>>> </str>
>>>> <strname="parsedquery_toString">
>>>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>>>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
>>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>>> roosevelt"~3^80.0)~0.01
>>>> </str>
>>>> <lstname="explain"/>
>>>> <strname="QParser">
>>>> DisMaxQParser
>>>> </str>
>>>> <nullname="altquerystring"/>
>>>> <nullname="boostfuncs"/>
>>>> <lstname="timing">
>>>> <doublename="time">
>>>> 6.0
>>>> </double>
>>>> <lstname="prepare">
>>>> <doublename="time">
>>>> 3.0
>>>> </double>
>>>> <lstname="org.apache.solr.handler.component.QueryComponent">
>>>> <doublename="time">
>>>> 2.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.FacetComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.StatsComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.DebugComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> </lst>
>>>>
>>>>
>>>>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Erick Erickson <er...@gmail.com>.
Jonathan:

Thanks for writing that up, you're right, it is arcane....

I've starred this one!

Erick

>
> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> So to understand, first familiarize yourself with that.
>
> However, none of the fields involved here had any stopwords at all, so at
> first it wasn't obvious this was the problem. But having different
> tokenization and other analysis between fields can result in exactly the
> same problem, for certain queries.
>
> One field in the dismax qf used an analyzer that stripped punctuation. (I'm
> actually not positive at this point _which_ analyzer in my chain was
> stripping punctuation, I'm using a bunch including some custom ones, but I
> was aware that punctuation was being stripped, this was intentional.)
>
> So "monkey's" turns into "monkey".  "monkey:" turns into "monkey".  So far
> so good. But what happens if you have punctuation all by itself seperated by
> whitespace?  "Roosevlet & Churchill" turns into ['roosevelt', 'churchill'].
>  That ampersand in the middle was stripped out, essentially _just as if_ it
> were a stopword. Only two tokens result from that input.
>
> You can see where this is going -- another field involved in the dismax qf
> did NOT strip out punctuation. So three tokens result from that input,
> ['Roosevelt', '&', 'Churchill'].
>
> Now we have exactly the situation that gives ride the dismax stopwords
> mm-behaving-funny situation, it's exactly the same thing.
>
> Now I've fixed this for punctuation just by making those fields strip out
> punctuation, by adding these analyzers to the bottom of those
> previously-not-stripping-punctuation field definitions:
>
> <!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
> <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"
>        />
> <!-- if after stripping punc we have any 0-length tokens, make
>              sure to eliminate them. We can use LengthFilter min=1 for that,
>              we dont' care about the max here, just a very large number. -->
> <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>
>
> And things are working are how I expect again, at least for this punctuation
> issue. But there may be other edge cases where differences in analysis
> result in different number of tokens from different fields, which if they
> are both included in a dismax qf, will have bad effects on 'mm'.
>
> The lesson I think, is that the only absolute safe way to use dismax 'mm',
> is when all fields in the 'qf' have exactly the same analysis.  But
> obviously that's not very practical, it destroys much of the power of
> dismax. And some differences in analysis are certainly acceptable -- but
> it's rather tricky to figure out if your differences in analysis are going
> to be significant for this problem, under what input, and if so fix them. It
> is not an easy thing to do.  So dismax definitely has this gotcha
> potentially waiting for you, whenever mixing fields with different analysis
> in a 'qf'.
>
>
> On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
>>
>> Okay, let's try the debug trace again without a pf to be less confusing.
>>
>> One field in qf, that's ordinary text tokenized, and does get hits:
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>> And that gets 25 hits. Now we add in a second field to the qf, this second
>> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
>> adding another field into qf, right? And indeed it still results in exactly
>> 25 hits (no additional hits from the additional qf field).
>>
>>
>> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
>> title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>>
>>
>> Okay, now we go back to just that first (ordinarily tokenized) field, but
>> add a second field in that uses KeywordTokenizerFactory.  We expect this not
>> neccesarily to ever match for a multi-word query, but we don't expect it to
>> be fewer than 25 hits, the 25 hits from the first field in the qf should
>> still be there, right? But it's not. What happened, why not?
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>>
>>
>> str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
>> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>> ()</str>
>> <str name="parsedquery_toString">+(((isbn_t:churchill |
>> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
>> title1_t:roosevelt)~0.01)~3) ()</str>
>>
>>
>>
>> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>>>
>>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in
>>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a
>>> whitespace-containing query is entered).  But I do it anyway, for cases
>>> where a non-whitespace-containing query is entered, then it hits.  And in
>>> those cases where it doesn't hit, I figure okay, well, the other fields in
>>> qf will hit or not, that's good enough.
>>>
>>> And usually that works. But it works _differently_ when my query contains
>>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
>>> and I can't figure out why.
>>>
>>> basically,
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>
>>> gets hits.  The ":" is thrown out the text_field, but the mm still passes
>>> somehow, right?
>>>
>>> But, in the same index:
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>> keyword_tokenized_text_field
>>>
>>> gets 0 hits.  Somehow maybe the inclusion of the
>>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm
>>> differently, decide there are three tokens in there and they all must match,
>>> and the token ":" can never match because it's not in my index it's stripped
>>> out... but somehow this isn't a problem unless I include a keyword-tokenized
>>>  field in the qf?
>>>
>>> This is really confusing, if anyone has any idea what I'm talking about
>>> it and can shed any light on it, much appreciated.
>>>
>>> The conclusion I am reaching is just NEVER include anything but a more or
>>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
>>> certain use cases for me.
>>>
>>> Oh, hey, the debugging trace woudl probably be useful:
>>>
>>>
>>> <lstname="debug">
>>> <strname="rawquerystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="querystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="parsedquery">
>>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
>>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01)
>>> </str>
>>> <strname="parsedquery_toString">
>>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01
>>> </str>
>>> <lstname="explain"/>
>>> <strname="QParser">
>>> DisMaxQParser
>>> </str>
>>> <nullname="altquerystring"/>
>>> <nullname="boostfuncs"/>
>>> <lstname="timing">
>>> <doublename="time">
>>> 6.0
>>> </double>
>>> <lstname="prepare">
>>> <doublename="time">
>>> 3.0
>>> </double>
>>> <lstname="org.apache.solr.handler.component.QueryComponent">
>>> <doublename="time">
>>> 2.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.FacetComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.StatsComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.DebugComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> </lst>
>>>
>>>
>>>
>>
>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Okay, I figured this one out -- I'm participating in a thread with 
myself here, but for benefit of posterity, or if anyone's interested, 
it's kind of interesting.

It's actually a variation of the known issue with dismax, mm, and fields 
with varying stopwords. Actually a pretty tricky problem with dismax, 
which it's now clear goes way beyond just stopwords.

http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

So to understand, first familiarize yourself with that.

However, none of the fields involved here had any stopwords at all, so 
at first it wasn't obvious this was the problem. But having different 
tokenization and other analysis between fields can result in exactly the 
same problem, for certain queries.

One field in the dismax qf used an analyzer that stripped punctuation. 
(I'm actually not positive at this point _which_ analyzer in my chain 
was stripping punctuation, I'm using a bunch including some custom ones, 
but I was aware that punctuation was being stripped, this was intentional.)

So "monkey's" turns into "monkey".  "monkey:" turns into "monkey".  So 
far so good. But what happens if you have punctuation all by itself 
seperated by whitespace?  "Roosevlet & Churchill" turns into 
['roosevelt', 'churchill'].  That ampersand in the middle was stripped 
out, essentially _just as if_ it were a stopword. Only two tokens result 
from that input.

You can see where this is going -- another field involved in the dismax 
qf did NOT strip out punctuation. So three tokens result from that 
input, ['Roosevelt', '&', 'Churchill'].

Now we have exactly the situation that gives ride the dismax stopwords 
mm-behaving-funny situation, it's exactly the same thing.

Now I've fixed this for punctuation just by making those fields strip 
out punctuation, by adding these analyzers to the bottom of those 
previously-not-stripping-punctuation field definitions:

<!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
<filter class="solr.PatternReplaceFilterFactory"
                 pattern="([\p{Punct}])" replacement="" replace="all"
         />
<!-- if after stripping punc we have any 0-length tokens, make
               sure to eliminate them. We can use LengthFilter min=1 for 
that,
               we dont' care about the max here, just a very large 
number. -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/>


And things are working are how I expect again, at least for this 
punctuation issue. But there may be other edge cases where differences 
in analysis result in different number of tokens from different fields, 
which if they are both included in a dismax qf, will have bad effects on 
'mm'.

The lesson I think, is that the only absolute safe way to use dismax 
'mm', is when all fields in the 'qf' have exactly the same analysis.  
But obviously that's not very practical, it destroys much of the power 
of dismax. And some differences in analysis are certainly acceptable -- 
but it's rather tricky to figure out if your differences in analysis are 
going to be significant for this problem, under what input, and if so 
fix them. It is not an easy thing to do.  So dismax definitely has this 
gotcha potentially waiting for you, whenever mixing fields with 
different analysis in a 'qf'.


On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
> Okay, let's try the debug trace again without a pf to be less confusing.
>
> One field in qf, that's ordinary text tokenized, and does get hits:
>
> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf= 
>
>
> <str name="rawquerystring">churchill : roosevelt</str>
> <str name="querystring">churchill : roosevelt</str>
> <str name="parsedquery">
> +((DisjunctionMaxQuery((title1_t:churchil)~0.01) 
> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
> </str>
> <str name="parsedquery_toString">
> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
> </str>
>
> And that gets 25 hits. Now we add in a second field to the qf, this 
> second field is also ordinarily tokenized. We expect no _fewer_ than 
> 25 hits, adding another field into qf, right? And indeed it still 
> results in exactly 25 hits (no additional hits from the additional qf 
> field).
>
> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf= 
>
>
> <str name="parsedquery">
> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) 
> DisjunctionMaxQuery((title2_t:roosevelt | 
> title1_t:roosevelt)~0.01))~2) ()
> </str>
> <str name="parsedquery_toString">
> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | 
> title1_t:roosevelt)~0.01)~2) ()
> </str>
>
>
>
> Okay, now we go back to just that first (ordinarily tokenized) field, 
> but add a second field in that uses KeywordTokenizerFactory.  We 
> expect this not neccesarily to ever match for a multi-word query, but 
> we don't expect it to be fewer than 25 hits, the 25 hits from the 
> first field in the qf should still be there, right? But it's not. What 
> happened, why not?
>
> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf= 
>
>
>
> str name="rawquerystring">churchill : roosevelt</str>
> <str name="querystring">churchill : roosevelt</str>
> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill | 
> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) 
> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
> ()</str>
> <str name="parsedquery_toString">+(((isbn_t:churchill | 
> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | 
> title1_t:roosevelt)~0.01)~3) ()</str>
>
>
>
> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>> I'm aware that using a field tokenized with KeywordTokenizerFactory 
>> is in a dismax 'qf' is often going to result in 0 hits on that field 
>> -- (when a whitespace-containing query is entered).  But I do it 
>> anyway, for cases where a non-whitespace-containing query is entered, 
>> then it hits.  And in those cases where it doesn't hit, I figure 
>> okay, well, the other fields in qf will hit or not, that's good enough.
>>
>> And usually that works. But it works _differently_ when my query 
>> contains an ampersand (or any other punctuation), result in 0 hits 
>> when it shoudln't, and I can't figure out why.
>>
>> basically,
>>
>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>
>> gets hits.  The ":" is thrown out the text_field, but the mm still 
>> passes somehow, right?
>>
>> But, in the same index:
>>
>> &defType=dismax&mm=100%&q=one : two&qf=text_field 
>> keyword_tokenized_text_field
>>
>> gets 0 hits.  Somehow maybe the inclusion of the 
>> keyword_tokenized_text_field in the qf causes dismax to calculate the 
>> mm differently, decide there are three tokens in there and they all 
>> must match, and the token ":" can never match because it's not in my 
>> index it's stripped out... but somehow this isn't a problem unless I 
>> include a keyword-tokenized  field in the qf?
>>
>> This is really confusing, if anyone has any idea what I'm talking 
>> about it and can shed any light on it, much appreciated.
>>
>> The conclusion I am reaching is just NEVER include anything but a 
>> more or less ordinarily tokenized field in a dismax qf. Sadly, it was 
>> useful for certain use cases for me.
>>
>> Oh, hey, the debugging trace woudl probably be useful:
>>
>>
>> <lstname="debug">
>> <strname="rawquerystring">
>> churchill : roosevelt
>> </str>
>> <strname="querystring">
>> churchill : roosevelt
>> </str>
>> <strname="parsedquery">
>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
>> DisjunctionMaxQuery((isbn_t::)~0.01) 
>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
>> DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
>> text:"churchil roosevelt"~3^10.0 | title2_t:"churchil 
>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 | 
>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
>> author2_unstem:"churchill roosevelt"~3^240.0 | 
>> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil 
>> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 
>> | subject_unstem:"churchill roosevelt"~3^80.0 | 
>> title_series_t:"churchil roosevelt"~3^40.0 | 
>> title_series_unstem:"churchill roosevelt"~3^60.0 | 
>> text_unstem:"churchill roosevelt"~3^80.0)~0.01)
>> </str>
>> <strname="parsedquery_toString">
>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
>> (title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
>> roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
>> author_unstem:"churchill roosevelt"~3^400.0 | 
>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
>> author2_unstem:"churchill roosevelt"~3^240.0 | 
>> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil 
>> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 
>> | subject_unstem:"churchill roosevelt"~3^80.0 | 
>> title_series_t:"churchil roosevelt"~3^40.0 | 
>> title_series_unstem:"churchill roosevelt"~3^60.0 | 
>> text_unstem:"churchill roosevelt"~3^80.0)~0.01
>> </str>
>> <lstname="explain"/>
>> <strname="QParser">
>> DisMaxQParser
>> </str>
>> <nullname="altquerystring"/>
>> <nullname="boostfuncs"/>
>> <lstname="timing">
>> <doublename="time">
>> 6.0
>> </double>
>> <lstname="prepare">
>> <doublename="time">
>> 3.0
>> </double>
>> <lstname="org.apache.solr.handler.component.QueryComponent">
>> <doublename="time">
>> 2.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.FacetComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.StatsComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.DebugComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> </lst>
>>
>>
>>
>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:

q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=

<str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">
+((DisjunctionMaxQuery((title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
</str>

And that gets 25 hits. Now we add in a second field to the qf, this 
second field is also ordinarily tokenized. We expect no _fewer_ than 25 
hits, adding another field into qf, right? And indeed it still results 
in exactly 25 hits (no additional hits from the additional qf field).

?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=

<str name="parsedquery">
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | 
title1_t:roosevelt)~0.01)~2) ()
</str>



Okay, now we go back to just that first (ordinarily tokenized) field, 
but add a second field in that uses KeywordTokenizerFactory.  We expect 
this not neccesarily to ever match for a multi-word query, but we don't 
expect it to be fewer than 25 hits, the 25 hits from the first field in 
the qf should still be there, right? But it's not. What happened, why not?

q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=


str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill | 
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
()</str>
<str name="parsedquery_toString">+(((isbn_t:churchill | 
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | 
title1_t:roosevelt)~0.01)~3) ()</str>



On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
> I'm aware that using a field tokenized with KeywordTokenizerFactory is 
> in a dismax 'qf' is often going to result in 0 hits on that field -- 
> (when a whitespace-containing query is entered).  But I do it anyway, 
> for cases where a non-whitespace-containing query is entered, then it 
> hits.  And in those cases where it doesn't hit, I figure okay, well, 
> the other fields in qf will hit or not, that's good enough.
>
> And usually that works. But it works _differently_ when my query 
> contains an ampersand (or any other punctuation), result in 0 hits 
> when it shoudln't, and I can't figure out why.
>
> basically,
>
> &defType=dismax&mm=100%&q=one : two&qf=text_field
>
> gets hits.  The ":" is thrown out the text_field, but the mm still 
> passes somehow, right?
>
> But, in the same index:
>
> &defType=dismax&mm=100%&q=one : two&qf=text_field 
> keyword_tokenized_text_field
>
> gets 0 hits.  Somehow maybe the inclusion of the 
> keyword_tokenized_text_field in the qf causes dismax to calculate the 
> mm differently, decide there are three tokens in there and they all 
> must match, and the token ":" can never match because it's not in my 
> index it's stripped out... but somehow this isn't a problem unless I 
> include a keyword-tokenized  field in the qf?
>
> This is really confusing, if anyone has any idea what I'm talking 
> about it and can shed any light on it, much appreciated.
>
> The conclusion I am reaching is just NEVER include anything but a more 
> or less ordinarily tokenized field in a dismax qf. Sadly, it was 
> useful for certain use cases for me.
>
> Oh, hey, the debugging trace woudl probably be useful:
>
>
> <lstname="debug">
> <strname="rawquerystring">
> churchill : roosevelt
> </str>
> <strname="querystring">
> churchill : roosevelt
> </str>
> <strname="parsedquery">
> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
> DisjunctionMaxQuery((isbn_t::)~0.01) 
> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
> DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
> text:"churchil roosevelt"~3^10.0 | title2_t:"churchil 
> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 | 
> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
> author2_unstem:"churchill roosevelt"~3^240.0 | 
> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil 
> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 | 
> subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
> roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
> text_unstem:"churchill roosevelt"~3^80.0)~0.01)
> </str>
> <strname="parsedquery_toString">
> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
> (title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
> roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
> author_unstem:"churchill roosevelt"~3^400.0 | 
> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
> author2_unstem:"churchill roosevelt"~3^240.0 | 
> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil 
> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 | 
> subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
> roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
> text_unstem:"churchill roosevelt"~3^80.0)~0.01
> </str>
> <lstname="explain"/>
> <strname="QParser">
> DisMaxQParser
> </str>
> <nullname="altquerystring"/>
> <nullname="boostfuncs"/>
> <lstname="timing">
> <doublename="time">
> 6.0
> </double>
> <lstname="prepare">
> <doublename="time">
> 3.0
> </double>
> <lstname="org.apache.solr.handler.component.QueryComponent">
> <doublename="time">
> 2.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.FacetComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.HighlightComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.StatsComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.DebugComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> </lst>
>
>
>