You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2011/06/14 23:19:10 UTC
ampersand, dismax, combining two fields, one of which is keywordTokenizer
I'm aware that using a field tokenized with KeywordTokenizerFactory is
in a dismax 'qf' is often going to result in 0 hits on that field --
(when a whitespace-containing query is entered). But I do it anyway,
for cases where a non-whitespace-containing query is entered, then it
hits. And in those cases where it doesn't hit, I figure okay, well, the
other fields in qf will hit or not, that's good enough.
And usually that works. But it works _differently_ when my query
contains an ampersand (or any other punctuation), result in 0 hits when
it shoudln't, and I can't figure out why.
basically,
&defType=dismax&mm=100%&q=one : two&qf=text_field
gets hits. The ":" is thrown out the text_field, but the mm still
passes somehow, right?
But, in the same index:
&defType=dismax&mm=100%&q=one : two&qf=text_field
keyword_tokenized_text_field
gets 0 hits. Somehow maybe the inclusion of the
keyword_tokenized_text_field in the qf causes dismax to calculate the mm
differently, decide there are three tokens in there and they all must
match, and the token ":" can never match because it's not in my index
it's stripped out... but somehow this isn't a problem unless I include a
keyword-tokenized field in the qf?
This is really confusing, if anyone has any idea what I'm talking about
it and can shed any light on it, much appreciated.
The conclusion I am reaching is just NEVER include anything but a more
or less ordinarily tokenized field in a dismax qf. Sadly, it was useful
for certain use cases for me.
Oh, hey, the debugging trace woudl probably be useful:
<lstname="debug">
<strname="rawquerystring">
churchill : roosevelt
</str>
<strname="querystring">
churchill : roosevelt
</str>
<strname="parsedquery">
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
DisjunctionMaxQuery((isbn_t::)~0.01)
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 |
text:"churchil roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0
| author_unstem:"churchill roosevelt"~3^400.0 |
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
other_number_unstem:"churchill roosevelt"~3^40.0 |
subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil
roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 |
text_unstem:"churchill roosevelt"~3^80.0)~0.01)
</str>
<strname="parsedquery_toString">
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3)
(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil
roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 |
author_unstem:"churchill roosevelt"~3^400.0 | title_exactmatch:churchill
roosevelt^500.0 | title1_t:"churchil roosevelt"~3^60.0 |
title1_unstem:"churchill roosevelt"~3^320.0 | author2_unstem:"churchill
roosevelt"~3^240.0 | title3_unstem:"churchill roosevelt"~3^80.0 |
subject_t:"churchil roosevelt"~3^10.0 | other_number_unstem:"churchill
roosevelt"~3^40.0 | subject_unstem:"churchill roosevelt"~3^80.0 |
title_series_t:"churchil roosevelt"~3^40.0 |
title_series_unstem:"churchill roosevelt"~3^60.0 |
text_unstem:"churchill roosevelt"~3^80.0)~0.01
</str>
<lstname="explain"/>
<strname="QParser">
DisMaxQParser
</str>
<nullname="altquerystring"/>
<nullname="boostfuncs"/>
<lstname="timing">
<doublename="time">
6.0
</double>
<lstname="prepare">
<doublename="time">
3.0
</double>
<lstname="org.apache.solr.handler.component.QueryComponent">
<doublename="time">
2.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.FacetComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.HighlightComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.StatsComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.SpellCheckComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.DebugComponent">
<doublename="time">
0.0
</double>
</lst>
</lst>
Re: ampersand, dismax, combining two fields, one of which is
keywordTokenizer
Posted by Chris Hostetter <ho...@fucit.org>.
: Maybe what I really need is a query parser that does not do "disjunction
: maximum" at all, but somehow still combines different 'qf' type fields with
: different boosts on each field. I personally don't _neccesarily_ need the
: actual "disjunction max" calculation, but I do need combining of mutiple
: fields with different boosts. Of course, I'm not sure exactly how it would
: combine multiple fields if not "disjunction maximum", but perhaps one is
: conceivable that wouldn't be subject to this particular gotcha with differing
: analysis.
you can sort of do that today, something like this should work...
q = _query_:"$q1"^100 _query_:"$q2"^10 _query_:"$q3"^5 _query_:"$q4"
q1 = {!lucene df=title v=$qq}
q2 = {!lucene df=summary v=$qq}
q3 = {!lucene df=author v=$qq}
q4 = {!lucene df=body v=$qq}
qq = ...user input here...
..but you might want to replace "lucene" with "field" depending on what
metacharacters you want to support.
in general though the reason i wrote the dismax parser (instead of a
parser that works like this) is because of how a multiword queries wind up
matching/scoring. A guy named Chuck Williams wrote the earliest
versoin of the DisjunctionMaxQuery class and his "albino elephant"
example totally sold me on this approach back in 2005...
http://www.lucidimagination.com/search/document/8ce795c4b6752a1f/contribution_better_multi_field_searching
https://issues.apache.org/jira/browse/LUCENE-323
: I also remain kind of confused about how the existing dismax figures out "how
: many terms" for the 'mm' type calculations. If someone wanted to explain that,
: I would find it enlightening and helpful for understanding what's going on.
it's not really about terms -- it's just the total number of clauses in
the outer BooleanQuery that it builds. if a chunk of input produces a
valid DisjunctionMaxQuery (because the analyzer for at least one qf field
generated tokens) then that's a clause, if a chunk of input doesn't
produce a token (because none of hte analyzers from any of the qf ields
generated tokens) then that's not a clause.
-Hoss
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Yeah, I see your points. It's complicated. I'm not sure either.
But the thing is:
> in order to use a feature like that you'd have to really think hard
about
> the query analysis of your fields, and which ones will produce which
> tokens in which situations
You need to think really hard about the (index and query) analysis of
your fields and which ones will produce which tokens _now_, if you are
using multiple fields in a 'qf' with differing analysis, and using a
percent mm. (Or similarly an mm that varies depending on how many terms).
That's what I've come to realize, that's the status quo. If your qf
fields don't all have identical analysis, right _now_ you need to think
really hard about the analysis and how it's going to possibly effect
'mm', including for edge case queries. If you don't, you likely have
edge case queries (at least) which aren't behaving how you expected
(whether you notice or have it brought to your attention by users or not).
Or you can just make sure all fields in your qf have identical analysis,
and then you don't have to worry about it. But that's not always
practical, a lot of the power of dismax qf ends up being combining
fields with different analysis.
So I was trying to think of a way to make this less so, but still be
able to take advantage of dismax, but I think you're right that maybe
there isn't any, or at least nothing we've come up with yet.
Maybe what I really need is a query parser that does not do "disjunction
maximum" at all, but somehow still combines different 'qf' type fields
with different boosts on each field. I personally don't _neccesarily_
need the actual "disjunction max" calculation, but I do need combining
of mutiple fields with different boosts. Of course, I'm not sure exactly
how it would combine multiple fields if not "disjunction maximum", but
perhaps one is conceivable that wouldn't be subject to this particular
gotcha with differing analysis.
I also remain kind of confused about how the existing dismax figures out
"how many terms" for the 'mm' type calculations. If someone wanted to
explain that, I would find it enlightening and helpful for
understanding what's going on.
Jonathan
On 6/21/2011 10:20 PM, Chris Hostetter wrote:
> : not other) setups/intentions. It's counter-intuitive to me that adding
> : a field to the 'qf' set results in _fewer_ hits than the same 'qf' set
>
> agreed .. but that's where looking the debug info comes in to understand
> the reason for that behavior is that your old qf treated part of your
> input as garbage and that new field respects it and uses it in the
> calculation.
>
> mind you: the "fewer hits" behavior only happens when using a percentage
> value in mm ... if you had mm=2 you'd get more results, but you've asked
> for "66%" (or whatever) and with that new qf there is a differnet number
> of clauses produced by query parsing.
>
> : I wonder if it would be a good idea to have a parameter to (e)dismax
> : that told it which of these two behaviors to use? The one where the
> : 'term count' is based on the maximum number of terms from any field in
> : the 'qf', and one where it's based on the minimum number of terms
> : produced from any field in the qf? I am still not sure how feasible
>
> even in your use case, i don't think you are fully considering what that
> would produce. imagine that an mmType=min param existed and gave you what
> you're asking for. Now imagine that you have two fields, one named
> "simple" that strips all punctuation and one named "complex" that doesn't,
> and you have a query like this...
>
> q=Foo& Bar
> qf=simple complex
> mm=100%
> mmType=min
>
> * Foo produces tokens for all qf
> *& only produces tokens for some qf (complex)
> * Bar products tokens for all qf
>
> your mmType would say "there are only 2 tokens that we can query across
> all fields, so our computed minShouldMatch should be 100% of 2 == 2"
>
> sounds good so far right?
>
> the problem is you still have query clause coming from that "&"
> character ... you have 3 real clauses, one of which is that term query for
> "complex:&" which means that with your (computed) minShouldMatch of 2 you
> would see matches for any doc that happened to have indexed the "&" symbol
> in the "complex" field and also matched *either* of Foo or Bar (in either
> field)
>
> So while a lot of your results would match both Foo and Bar, you'd get
> still get a bunch of weird results.
>
> : Or maybe a feature where you tell dismax, the number of tokens produced
> : by field X, THAT's the one you should use for your 'term count' for mm,
>
> Hmmm.... maybe. i'd have to see a patch in action and play with it, to
> really think it through ... hmmm ... honestly i really can't imagine how
> that would be helpful in general...
>
> in order to use a feature like that you'd have to really think hard about
> the query analysis of your fields, and which ones will produce which
> tokens in which situations in order to make sure you pick the *right*
> value for that param -- but once you've done that hard thinking you might
> as well feed it back into your schema.xml and say "the query analyzer for
> field 'complex' should prune any tokens that only contain punctuation"
> (instead of saying "'complex' will produce tokens that only contain
> punctuation, so lets tell dismax to compute mm based only on 'simple').
> Afterall, there might not be one single field that you can pick -- maybe
> 'complex' lets tokens that are all punctuation through but strips
> stopwords, and maybe 'simple' does the opposite ... no param value you
> pick will help you with that possibility, you really just need to fix the
> query analyzers to make sense if you want to use both of those two fields
> in the qf.
>
>
> -Hoss
>
RE: ampersand, dismax, combining two fields, one of which is
keywordTokenizer
Posted by Chris Hostetter <ho...@fucit.org>.
: not other) setups/intentions. It's counter-intuitive to me that adding
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set
agreed .. but that's where looking the debug info comes in to understand
the reason for that behavior is that your old qf treated part of your
input as garbage and that new field respects it and uses it in the
calculation.
mind you: the "fewer hits" behavior only happens when using a percentage
value in mm ... if you had mm=2 you'd get more results, but you've asked
for "66%" (or whatever) and with that new qf there is a differnet number
of clauses produced by query parsing.
: I wonder if it would be a good idea to have a parameter to (e)dismax
: that told it which of these two behaviors to use? The one where the
: 'term count' is based on the maximum number of terms from any field in
: the 'qf', and one where it's based on the minimum number of terms
: produced from any field in the qf? I am still not sure how feasible
even in your use case, i don't think you are fully considering what that
would produce. imagine that an mmType=min param existed and gave you what
you're asking for. Now imagine that you have two fields, one named
"simple" that strips all punctuation and one named "complex" that doesn't,
and you have a query like this...
q=Foo & Bar
qf=simple complex
mm=100%
mmType=min
* Foo produces tokens for all qf
* & only produces tokens for some qf (complex)
* Bar products tokens for all qf
your mmType would say "there are only 2 tokens that we can query across
all fields, so our computed minShouldMatch should be 100% of 2 == 2"
sounds good so far right?
the problem is you still have query clause coming from that "&"
character ... you have 3 real clauses, one of which is that term query for
"complex:&" which means that with your (computed) minShouldMatch of 2 you
would see matches for any doc that happened to have indexed the "&" symbol
in the "complex" field and also matched *either* of Foo or Bar (in either
field)
So while a lot of your results would match both Foo and Bar, you'd get
still get a bunch of weird results.
: Or maybe a feature where you tell dismax, the number of tokens produced
: by field X, THAT's the one you should use for your 'term count' for mm,
Hmmm.... maybe. i'd have to see a patch in action and play with it, to
really think it through ... hmmm ... honestly i really can't imagine how
that would be helpful in general...
in order to use a feature like that you'd have to really think hard about
the query analysis of your fields, and which ones will produce which
tokens in which situations in order to make sure you pick the *right*
value for that param -- but once you've done that hard thinking you might
as well feed it back into your schema.xml and say "the query analyzer for
field 'complex' should prune any tokens that only contain punctuation"
(instead of saying "'complex' will produce tokens that only contain
punctuation, so lets tell dismax to compute mm based only on 'simple').
Afterall, there might not be one single field that you can pick -- maybe
'complex' lets tokens that are all punctuation through but strips
stopwords, and maybe 'simple' does the opposite ... no param value you
pick will help you with that possibility, you really just need to fix the
query analyzers to make sense if you want to use both of those two fields
in the qf.
-Hoss
RE: ampersand, dismax, combining two fields, one of which is
keywordTokenizer
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Thanks, that's helpful.
It still seems like current behavior does the "wrong" thing in _many_ cases (I know a lot of people get tripped up by it, sometimes on this list) -- but I understand your cases where it does the right thing, and where what I'm suggesting would be the wrong thing.
> Ultimately the problem you had with "&" is the same problem people have
> with stopwords, and comes down to the same thing: if you don't want some
> chunk of text to be "significant" when searchng a field in your qf, have
> your analyzer remove it
Ah, but see the problem people have with stopwords is when they actually DID that. They didn't want a term to be 'significant' in one field, but they DID want it to be 'significant' in another field... but how this effects the 'mm' ends up being kind of counter-intuitive for some (but not other) setups/intentions. It's counter-intuitive to me that adding a field to the 'qf' set results in _fewer_ hits than the same 'qf' set without the new field -- although I understand your cases where you added the field to the 'qf' precisely in order to intentionally get that behavior, that's definitely not a universal case.
And the fact that unpredictable changes to field analysis that aren't as simple as stopwords can lead to this same problem (as in this case where one field ignores punctuation and the other doesn't) -- it's definitely a trap waiting for some people.
I wonder if it would be a good idea to have a parameter to (e)dismax that told it which of these two behaviors to use? The one where the 'term count' is based on the maximum number of terms from any field in the 'qf', and one where it's based on the minimum number of terms produced from any field in the qf? I am still not sure how feasible THAT is, but it seems like a good idea to me. The current behavior is definitely a pitfall for many people.
Or maybe a feature where you tell dismax, the number of tokens produced by field X, THAT's the one you should use for your 'term count' for mm, all the other fields are really just in there as sort of supplementary -- for boosting, or for bringing a few more results in; but NOT the case where you intentionally add a 'qf' with KeepWordsFilter in order to intentionally _reduce_ the result set . I think that's a pretty common use case too.
Jonathan
Re: ampersand, dismax, combining two fields, one of which is
keywordTokenizer
Posted by Chris Hostetter <ho...@fucit.org>.
: It seems like the problem is when different fields in the 'qf' produce a
: different number of tokens for a given query. dismax needs to know the number
: of tokens in the input in order to calculate 'mm', when 'mm' is expressed as a
: percentage, or when different mm's are given for different numbers of input
: tokens.
actually the fundmental problem is that when this situation arrises,
dismax has no way of knowing *if* you want the token that only produced a
TermQuery in fieldA but not fieldB to counted at all.
In your case, you don't want the "&" query against your simple (non
whitespace striping) field to count in computing minShouldMax, but how
does dismax know that?
if someone has a field that not only strips out punctuation, but also
ignores anything that doesn't match one of my known keywords (using the
KeepWordsFilter) they woud want the exact oposite situation as you -- they
are really counting on the cases where a token produces a valid query for
that special field to be a factor, don't want the number of clauses used
to compute minShouldMatch to be lowered artificially just all the other
tokens in the input don't don't produce anything for that field.
bottom line: as long as one field produces a token for a chunk of input,
that's a clause -- it may only be a clause that's queried against one
field, but it's still a clause.
: So what if dismax could recognize that different fields were producing
: different arrity of input, and use the _smallest_ number for it's 'mm'
: calculations, instead of current behavior where it's effectively the largest
: number? (Or '1' if the smallest number is '0'?!) That would in some cases
: produce errors in the other direction -- more hits coming back than you
: naively/intuitively expect. Not sure if that would be worse or better. Seems
: better to me, less bad failure mode.
consider my previous example, and something similar to Jira searching
where you might have a "projectCode" field with a query time
KeepWordsFilter that only matches project codes ... right now, a query
like q=SOLR+foo+bar+baz&mm=100%&wf=productCode^100+text would give you
some really nice results that match all the input, but if SOLR is a
projectCode those issues bubble to the top -- with your proposal, the
effective mm would be "1" (because the projectCode field would only wind
up with the SOLR clause) and you'd get all sorts of crap -- because those
other clauses are all still there. so you'd get *all* project:Solr
issues, and *all* issues matching text:foo, and *all* issues matching
text:bar etc...
: Or better yet, but surely harder perhaps infeasible to code, it would somehow
: apply the 'mm' differently to each field. Not even sure what that means
That's pretty much impossible. the whole nature of the dismax style
parser is that a DisjunctionMaxQuery is computed for each "word" of the
q, across all "fields" in the qf -- it's those DisjunctionMaxQueries that
are wrapped in a BooleanQuery with minShouldMatch set on it...
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
...if you "fliped" that matrix along the diagonal to hvae a differnet mm
per field, you'd lose the value of the field specific boosts.
Ultimately the problem you had with "&" is the same problem people have
with stopwords, and comes down to the same thing: if you don't want some
chunk of text to be "significant" when searchng a field in your qf, have
your analyzer remove it -- if the analyzer for a field in the qf produces
a token, dismax assumes it's significant to the query and factors into the
mm and matching and scoring.
-Hoss
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Thanks. I'm trying to think through if there's any hypothetical way for
dismax to be improved to not be subject to this problem. Now that it's
clear that the problem isn't just with stopwords, and that in fact it's
very hard to predict if you'll get the problem and under what input,
when creating your schema and 'qf' list.... it seems a worse problem
than it did when it was thought of as just stopwords-related.
Of course, I'm trying to think through this without actually
understanding the dismax code at all, just based on what I know of how
dismax works from black box observation.
It seems like the problem is when different fields in the 'qf' produce a
different number of tokens for a given query. dismax needs to know the
number of tokens in the input in order to calculate 'mm', when 'mm' is
expressed as a percentage, or when different mm's are given for
different numbers of input tokens.
Somehow dismax gets at this number now, based on the actual field
analysis, not just whitespace-splitting at the query parser level.
Because if I issue query "roosevelt & churchill", and ALL the fields
involved have analysis that turns this into just two tokens
['roosevlet', 'churchill'], then dismax does the right thing,
recognizing two terms in the input. The problem is when some of the
fields produce two tokens from that input, and others produce three ---
dismax, I think, then decides there are three terms in input, but in at
least some fields those 'three' terms can't possibly all match.
So what if dismax could recognize that different fields were producing
different arrity of input, and use the _smallest_ number for it's 'mm'
calculations, instead of current behavior where it's effectively the
largest number? (Or '1' if the smallest number is '0'?!) That would in
some cases produce errors in the other direction -- more hits coming
back than you naively/intuitively expect. Not sure if that would be
worse or better. Seems better to me, less bad failure mode.
Or better yet, but surely harder perhaps infeasible to code, it would
somehow apply the 'mm' differently to each field. Not even sure what
that means exactly. But somehow an mm of 100% means two terms in the
field that analysis to 2 OR three terms in the field that analyses to
3... man, that's a mess. Okay, stick with the first idea.
But I've got no idea how feasible that is to code, and I personally have
no time to figure out how to code it, and nobody else is likely to since
this problem is unlikely to be a high priority for solr committers....
so, I dunno.
On 6/15/2011 3:46 PM, Erick Erickson wrote:
> Jonathan:
>
> Thanks for writing that up, you're right, it is arcane....
>
> I've starred this one!
>
> Erick
>
>> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
>> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>>
>> So to understand, first familiarize yourself with that.
>>
>> However, none of the fields involved here had any stopwords at all, so at
>> first it wasn't obvious this was the problem. But having different
>> tokenization and other analysis between fields can result in exactly the
>> same problem, for certain queries.
>>
>> One field in the dismax qf used an analyzer that stripped punctuation. (I'm
>> actually not positive at this point _which_ analyzer in my chain was
>> stripping punctuation, I'm using a bunch including some custom ones, but I
>> was aware that punctuation was being stripped, this was intentional.)
>>
>> So "monkey's" turns into "monkey". "monkey:" turns into "monkey". So far
>> so good. But what happens if you have punctuation all by itself seperated by
>> whitespace? "Roosevlet& Churchill" turns into ['roosevelt', 'churchill'].
>> That ampersand in the middle was stripped out, essentially _just as if_ it
>> were a stopword. Only two tokens result from that input.
>>
>> You can see where this is going -- another field involved in the dismax qf
>> did NOT strip out punctuation. So three tokens result from that input,
>> ['Roosevelt', '&', 'Churchill'].
>>
>> Now we have exactly the situation that gives ride the dismax stopwords
>> mm-behaving-funny situation, it's exactly the same thing.
>>
>> Now I've fixed this for punctuation just by making those fields strip out
>> punctuation, by adding these analyzers to the bottom of those
>> previously-not-stripping-punctuation field definitions:
>>
>> <!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([\p{Punct}])" replacement="" replace="all"
>> />
>> <!-- if after stripping punc we have any 0-length tokens, make
>> sure to eliminate them. We can use LengthFilter min=1 for that,
>> we dont' care about the max here, just a very large number. -->
>> <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>
>>
>> And things are working are how I expect again, at least for this punctuation
>> issue. But there may be other edge cases where differences in analysis
>> result in different number of tokens from different fields, which if they
>> are both included in a dismax qf, will have bad effects on 'mm'.
>>
>> The lesson I think, is that the only absolute safe way to use dismax 'mm',
>> is when all fields in the 'qf' have exactly the same analysis. But
>> obviously that's not very practical, it destroys much of the power of
>> dismax. And some differences in analysis are certainly acceptable -- but
>> it's rather tricky to figure out if your differences in analysis are going
>> to be significant for this problem, under what input, and if so fix them. It
>> is not an easy thing to do. So dismax definitely has this gotcha
>> potentially waiting for you, whenever mixing fields with different analysis
>> in a 'qf'.
>>
>>
>> On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
>>> Okay, let's try the debug trace again without a pf to be less confusing.
>>>
>>> One field in qf, that's ordinary text tokenized, and does get hits:
>>>
>>>
>>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>>>
>>> <str name="rawquerystring">churchill : roosevelt</str>
>>> <str name="querystring">churchill : roosevelt</str>
>>> <str name="parsedquery">
>>> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
>>> </str>
>>> <str name="parsedquery_toString">
>>> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
>>> </str>
>>>
>>> And that gets 25 hits. Now we add in a second field to the qf, this second
>>> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
>>> adding another field into qf, right? And indeed it still results in exactly
>>> 25 hits (no additional hits from the additional qf field).
>>>
>>>
>>> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>>>
>>> <str name="parsedquery">
>>> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
>>> </str>
>>> <str name="parsedquery_toString">
>>> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
>>> title1_t:roosevelt)~0.01)~2) ()
>>> </str>
>>>
>>>
>>>
>>> Okay, now we go back to just that first (ordinarily tokenized) field, but
>>> add a second field in that uses KeywordTokenizerFactory. We expect this not
>>> neccesarily to ever match for a multi-word query, but we don't expect it to
>>> be fewer than 25 hits, the 25 hits from the first field in the qf should
>>> still be there, right? But it's not. What happened, why not?
>>>
>>>
>>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>>>
>>>
>>> str name="rawquerystring">churchill : roosevelt</str>
>>> <str name="querystring">churchill : roosevelt</str>
>>> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
>>> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
>>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>>> ()</str>
>>> <str name="parsedquery_toString">+(((isbn_t:churchill |
>>> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
>>> title1_t:roosevelt)~0.01)~3) ()</str>
>>>
>>>
>>>
>>> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>>>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in
>>>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a
>>>> whitespace-containing query is entered). But I do it anyway, for cases
>>>> where a non-whitespace-containing query is entered, then it hits. And in
>>>> those cases where it doesn't hit, I figure okay, well, the other fields in
>>>> qf will hit or not, that's good enough.
>>>>
>>>> And usually that works. But it works _differently_ when my query contains
>>>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
>>>> and I can't figure out why.
>>>>
>>>> basically,
>>>>
>>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>>
>>>> gets hits. The ":" is thrown out the text_field, but the mm still passes
>>>> somehow, right?
>>>>
>>>> But, in the same index:
>>>>
>>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>> keyword_tokenized_text_field
>>>>
>>>> gets 0 hits. Somehow maybe the inclusion of the
>>>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm
>>>> differently, decide there are three tokens in there and they all must match,
>>>> and the token ":" can never match because it's not in my index it's stripped
>>>> out... but somehow this isn't a problem unless I include a keyword-tokenized
>>>> field in the qf?
>>>>
>>>> This is really confusing, if anyone has any idea what I'm talking about
>>>> it and can shed any light on it, much appreciated.
>>>>
>>>> The conclusion I am reaching is just NEVER include anything but a more or
>>>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
>>>> certain use cases for me.
>>>>
>>>> Oh, hey, the debugging trace woudl probably be useful:
>>>>
>>>>
>>>> <lstname="debug">
>>>> <strname="rawquerystring">
>>>> churchill : roosevelt
>>>> </str>
>>>> <strname="querystring">
>>>> churchill : roosevelt
>>>> </str>
>>>> <strname="parsedquery">
>>>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>>>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
>>>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
>>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>>> roosevelt"~3^80.0)~0.01)
>>>> </str>
>>>> <strname="parsedquery_toString">
>>>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>>>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
>>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>>> roosevelt"~3^80.0)~0.01
>>>> </str>
>>>> <lstname="explain"/>
>>>> <strname="QParser">
>>>> DisMaxQParser
>>>> </str>
>>>> <nullname="altquerystring"/>
>>>> <nullname="boostfuncs"/>
>>>> <lstname="timing">
>>>> <doublename="time">
>>>> 6.0
>>>> </double>
>>>> <lstname="prepare">
>>>> <doublename="time">
>>>> 3.0
>>>> </double>
>>>> <lstname="org.apache.solr.handler.component.QueryComponent">
>>>> <doublename="time">
>>>> 2.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.FacetComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.StatsComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> <lstname="org.apache.solr.handler.component.DebugComponent">
>>>> <doublename="time">
>>>> 0.0
>>>> </double>
>>>> </lst>
>>>> </lst>
>>>>
>>>>
>>>>
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Posted by Erick Erickson <er...@gmail.com>.
Jonathan:
Thanks for writing that up, you're right, it is arcane....
I've starred this one!
Erick
>
> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> So to understand, first familiarize yourself with that.
>
> However, none of the fields involved here had any stopwords at all, so at
> first it wasn't obvious this was the problem. But having different
> tokenization and other analysis between fields can result in exactly the
> same problem, for certain queries.
>
> One field in the dismax qf used an analyzer that stripped punctuation. (I'm
> actually not positive at this point _which_ analyzer in my chain was
> stripping punctuation, I'm using a bunch including some custom ones, but I
> was aware that punctuation was being stripped, this was intentional.)
>
> So "monkey's" turns into "monkey". "monkey:" turns into "monkey". So far
> so good. But what happens if you have punctuation all by itself seperated by
> whitespace? "Roosevlet & Churchill" turns into ['roosevelt', 'churchill'].
> That ampersand in the middle was stripped out, essentially _just as if_ it
> were a stopword. Only two tokens result from that input.
>
> You can see where this is going -- another field involved in the dismax qf
> did NOT strip out punctuation. So three tokens result from that input,
> ['Roosevelt', '&', 'Churchill'].
>
> Now we have exactly the situation that gives ride the dismax stopwords
> mm-behaving-funny situation, it's exactly the same thing.
>
> Now I've fixed this for punctuation just by making those fields strip out
> punctuation, by adding these analyzers to the bottom of those
> previously-not-stripping-punctuation field definitions:
>
> <!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\p{Punct}])" replacement="" replace="all"
> />
> <!-- if after stripping punc we have any 0-length tokens, make
> sure to eliminate them. We can use LengthFilter min=1 for that,
> we dont' care about the max here, just a very large number. -->
> <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>
>
> And things are working are how I expect again, at least for this punctuation
> issue. But there may be other edge cases where differences in analysis
> result in different number of tokens from different fields, which if they
> are both included in a dismax qf, will have bad effects on 'mm'.
>
> The lesson I think, is that the only absolute safe way to use dismax 'mm',
> is when all fields in the 'qf' have exactly the same analysis. But
> obviously that's not very practical, it destroys much of the power of
> dismax. And some differences in analysis are certainly acceptable -- but
> it's rather tricky to figure out if your differences in analysis are going
> to be significant for this problem, under what input, and if so fix them. It
> is not an easy thing to do. So dismax definitely has this gotcha
> potentially waiting for you, whenever mixing fields with different analysis
> in a 'qf'.
>
>
> On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
>>
>> Okay, let's try the debug trace again without a pf to be less confusing.
>>
>> One field in qf, that's ordinary text tokenized, and does get hits:
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>> And that gets 25 hits. Now we add in a second field to the qf, this second
>> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
>> adding another field into qf, right? And indeed it still results in exactly
>> 25 hits (no additional hits from the additional qf field).
>>
>>
>> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
>> title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>>
>>
>> Okay, now we go back to just that first (ordinarily tokenized) field, but
>> add a second field in that uses KeywordTokenizerFactory. We expect this not
>> neccesarily to ever match for a multi-word query, but we don't expect it to
>> be fewer than 25 hits, the 25 hits from the first field in the qf should
>> still be there, right? But it's not. What happened, why not?
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>>
>>
>> str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
>> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>> ()</str>
>> <str name="parsedquery_toString">+(((isbn_t:churchill |
>> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
>> title1_t:roosevelt)~0.01)~3) ()</str>
>>
>>
>>
>> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>>>
>>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in
>>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a
>>> whitespace-containing query is entered). But I do it anyway, for cases
>>> where a non-whitespace-containing query is entered, then it hits. And in
>>> those cases where it doesn't hit, I figure okay, well, the other fields in
>>> qf will hit or not, that's good enough.
>>>
>>> And usually that works. But it works _differently_ when my query contains
>>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
>>> and I can't figure out why.
>>>
>>> basically,
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>
>>> gets hits. The ":" is thrown out the text_field, but the mm still passes
>>> somehow, right?
>>>
>>> But, in the same index:
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>> keyword_tokenized_text_field
>>>
>>> gets 0 hits. Somehow maybe the inclusion of the
>>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm
>>> differently, decide there are three tokens in there and they all must match,
>>> and the token ":" can never match because it's not in my index it's stripped
>>> out... but somehow this isn't a problem unless I include a keyword-tokenized
>>> field in the qf?
>>>
>>> This is really confusing, if anyone has any idea what I'm talking about
>>> it and can shed any light on it, much appreciated.
>>>
>>> The conclusion I am reaching is just NEVER include anything but a more or
>>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
>>> certain use cases for me.
>>>
>>> Oh, hey, the debugging trace woudl probably be useful:
>>>
>>>
>>> <lstname="debug">
>>> <strname="rawquerystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="querystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="parsedquery">
>>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
>>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01)
>>> </str>
>>> <strname="parsedquery_toString">
>>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01
>>> </str>
>>> <lstname="explain"/>
>>> <strname="QParser">
>>> DisMaxQParser
>>> </str>
>>> <nullname="altquerystring"/>
>>> <nullname="boostfuncs"/>
>>> <lstname="timing">
>>> <doublename="time">
>>> 6.0
>>> </double>
>>> <lstname="prepare">
>>> <doublename="time">
>>> 3.0
>>> </double>
>>> <lstname="org.apache.solr.handler.component.QueryComponent">
>>> <doublename="time">
>>> 2.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.FacetComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.StatsComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.DebugComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> </lst>
>>>
>>>
>>>
>>
>
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Okay, I figured this one out -- I'm participating in a thread with
myself here, but for benefit of posterity, or if anyone's interested,
it's kind of interesting.
It's actually a variation of the known issue with dismax, mm, and fields
with varying stopwords. Actually a pretty tricky problem with dismax,
which it's now clear goes way beyond just stopwords.
http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
So to understand, first familiarize yourself with that.
However, none of the fields involved here had any stopwords at all, so
at first it wasn't obvious this was the problem. But having different
tokenization and other analysis between fields can result in exactly the
same problem, for certain queries.
One field in the dismax qf used an analyzer that stripped punctuation.
(I'm actually not positive at this point _which_ analyzer in my chain
was stripping punctuation, I'm using a bunch including some custom ones,
but I was aware that punctuation was being stripped, this was intentional.)
So "monkey's" turns into "monkey". "monkey:" turns into "monkey". So
far so good. But what happens if you have punctuation all by itself
seperated by whitespace? "Roosevlet & Churchill" turns into
['roosevelt', 'churchill']. That ampersand in the middle was stripped
out, essentially _just as if_ it were a stopword. Only two tokens result
from that input.
You can see where this is going -- another field involved in the dismax
qf did NOT strip out punctuation. So three tokens result from that
input, ['Roosevelt', '&', 'Churchill'].
Now we have exactly the situation that gives ride the dismax stopwords
mm-behaving-funny situation, it's exactly the same thing.
Now I've fixed this for punctuation just by making those fields strip
out punctuation, by adding these analyzers to the bottom of those
previously-not-stripping-punctuation field definitions:
<!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([\p{Punct}])" replacement="" replace="all"
/>
<!-- if after stripping punc we have any 0-length tokens, make
sure to eliminate them. We can use LengthFilter min=1 for
that,
we dont' care about the max here, just a very large
number. -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
And things are working are how I expect again, at least for this
punctuation issue. But there may be other edge cases where differences
in analysis result in different number of tokens from different fields,
which if they are both included in a dismax qf, will have bad effects on
'mm'.
The lesson I think, is that the only absolute safe way to use dismax
'mm', is when all fields in the 'qf' have exactly the same analysis.
But obviously that's not very practical, it destroys much of the power
of dismax. And some differences in analysis are certainly acceptable --
but it's rather tricky to figure out if your differences in analysis are
going to be significant for this problem, under what input, and if so
fix them. It is not an easy thing to do. So dismax definitely has this
gotcha potentially waiting for you, whenever mixing fields with
different analysis in a 'qf'.
On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
> Okay, let's try the debug trace again without a pf to be less confusing.
>
> One field in qf, that's ordinary text tokenized, and does get hits:
>
> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>
>
> <str name="rawquerystring">churchill : roosevelt</str>
> <str name="querystring">churchill : roosevelt</str>
> <str name="parsedquery">
> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
> </str>
> <str name="parsedquery_toString">
> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
> </str>
>
> And that gets 25 hits. Now we add in a second field to the qf, this
> second field is also ordinarily tokenized. We expect no _fewer_ than
> 25 hits, adding another field into qf, right? And indeed it still
> results in exactly 25 hits (no additional hits from the additional qf
> field).
>
> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>
>
> <str name="parsedquery">
> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
> DisjunctionMaxQuery((title2_t:roosevelt |
> title1_t:roosevelt)~0.01))~2) ()
> </str>
> <str name="parsedquery_toString">
> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
> title1_t:roosevelt)~0.01)~2) ()
> </str>
>
>
>
> Okay, now we go back to just that first (ordinarily tokenized) field,
> but add a second field in that uses KeywordTokenizerFactory. We
> expect this not neccesarily to ever match for a multi-word query, but
> we don't expect it to be fewer than 25 hits, the 25 hits from the
> first field in the qf should still be there, right? But it's not. What
> happened, why not?
>
> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>
>
>
> str name="rawquerystring">churchill : roosevelt</str>
> <str name="querystring">churchill : roosevelt</str>
> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
> ()</str>
> <str name="parsedquery_toString">+(((isbn_t:churchill |
> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
> title1_t:roosevelt)~0.01)~3) ()</str>
>
>
>
> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>> I'm aware that using a field tokenized with KeywordTokenizerFactory
>> is in a dismax 'qf' is often going to result in 0 hits on that field
>> -- (when a whitespace-containing query is entered). But I do it
>> anyway, for cases where a non-whitespace-containing query is entered,
>> then it hits. And in those cases where it doesn't hit, I figure
>> okay, well, the other fields in qf will hit or not, that's good enough.
>>
>> And usually that works. But it works _differently_ when my query
>> contains an ampersand (or any other punctuation), result in 0 hits
>> when it shoudln't, and I can't figure out why.
>>
>> basically,
>>
>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>
>> gets hits. The ":" is thrown out the text_field, but the mm still
>> passes somehow, right?
>>
>> But, in the same index:
>>
>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>> keyword_tokenized_text_field
>>
>> gets 0 hits. Somehow maybe the inclusion of the
>> keyword_tokenized_text_field in the qf causes dismax to calculate the
>> mm differently, decide there are three tokens in there and they all
>> must match, and the token ":" can never match because it's not in my
>> index it's stripped out... but somehow this isn't a problem unless I
>> include a keyword-tokenized field in the qf?
>>
>> This is really confusing, if anyone has any idea what I'm talking
>> about it and can shed any light on it, much appreciated.
>>
>> The conclusion I am reaching is just NEVER include anything but a
>> more or less ordinarily tokenized field in a dismax qf. Sadly, it was
>> useful for certain use cases for me.
>>
>> Oh, hey, the debugging trace woudl probably be useful:
>>
>>
>> <lstname="debug">
>> <strname="rawquerystring">
>> churchill : roosevelt
>> </str>
>> <strname="querystring">
>> churchill : roosevelt
>> </str>
>> <strname="parsedquery">
>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((isbn_t::)~0.01)
>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>> DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 |
>> text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>> author2_unstem:"churchill roosevelt"~3^240.0 |
>> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil
>> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0
>> | subject_unstem:"churchill roosevelt"~3^80.0 |
>> title_series_t:"churchil roosevelt"~3^40.0 |
>> title_series_unstem:"churchill roosevelt"~3^60.0 |
>> text_unstem:"churchill roosevelt"~3^80.0)~0.01)
>> </str>
>> <strname="parsedquery_toString">
>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3)
>> (title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil
>> roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 |
>> author_unstem:"churchill roosevelt"~3^400.0 |
>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>> author2_unstem:"churchill roosevelt"~3^240.0 |
>> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil
>> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0
>> | subject_unstem:"churchill roosevelt"~3^80.0 |
>> title_series_t:"churchil roosevelt"~3^40.0 |
>> title_series_unstem:"churchill roosevelt"~3^60.0 |
>> text_unstem:"churchill roosevelt"~3^80.0)~0.01
>> </str>
>> <lstname="explain"/>
>> <strname="QParser">
>> DisMaxQParser
>> </str>
>> <nullname="altquerystring"/>
>> <nullname="boostfuncs"/>
>> <lstname="timing">
>> <doublename="time">
>> 6.0
>> </double>
>> <lstname="prepare">
>> <doublename="time">
>> 3.0
>> </double>
>> <lstname="org.apache.solr.handler.component.QueryComponent">
>> <doublename="time">
>> 2.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.FacetComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.StatsComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> <lstname="org.apache.solr.handler.component.DebugComponent">
>> <doublename="time">
>> 0.0
>> </double>
>> </lst>
>> </lst>
>>
>>
>>
>
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Okay, let's try the debug trace again without a pf to be less confusing.
One field in qf, that's ordinary text tokenized, and does get hits:
q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
<str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">
+((DisjunctionMaxQuery((title1_t:churchil)~0.01)
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
</str>
And that gets 25 hits. Now we add in a second field to the qf, this
second field is also ordinarily tokenized. We expect no _fewer_ than 25
hits, adding another field into qf, right? And indeed it still results
in exactly 25 hits (no additional hits from the additional qf field).
?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
<str name="parsedquery">
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
title1_t:roosevelt)~0.01)~2) ()
</str>
Okay, now we go back to just that first (ordinarily tokenized) field,
but add a second field in that uses KeywordTokenizerFactory. We expect
this not neccesarily to ever match for a multi-word query, but we don't
expect it to be fewer than 25 hits, the 25 hits from the first field in
the qf should still be there, right? But it's not. What happened, why not?
q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
()</str>
<str name="parsedquery_toString">+(((isbn_t:churchill |
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
title1_t:roosevelt)~0.01)~3) ()</str>
On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
> I'm aware that using a field tokenized with KeywordTokenizerFactory is
> in a dismax 'qf' is often going to result in 0 hits on that field --
> (when a whitespace-containing query is entered). But I do it anyway,
> for cases where a non-whitespace-containing query is entered, then it
> hits. And in those cases where it doesn't hit, I figure okay, well,
> the other fields in qf will hit or not, that's good enough.
>
> And usually that works. But it works _differently_ when my query
> contains an ampersand (or any other punctuation), result in 0 hits
> when it shoudln't, and I can't figure out why.
>
> basically,
>
> &defType=dismax&mm=100%&q=one : two&qf=text_field
>
> gets hits. The ":" is thrown out the text_field, but the mm still
> passes somehow, right?
>
> But, in the same index:
>
> &defType=dismax&mm=100%&q=one : two&qf=text_field
> keyword_tokenized_text_field
>
> gets 0 hits. Somehow maybe the inclusion of the
> keyword_tokenized_text_field in the qf causes dismax to calculate the
> mm differently, decide there are three tokens in there and they all
> must match, and the token ":" can never match because it's not in my
> index it's stripped out... but somehow this isn't a problem unless I
> include a keyword-tokenized field in the qf?
>
> This is really confusing, if anyone has any idea what I'm talking
> about it and can shed any light on it, much appreciated.
>
> The conclusion I am reaching is just NEVER include anything but a more
> or less ordinarily tokenized field in a dismax qf. Sadly, it was
> useful for certain use cases for me.
>
> Oh, hey, the debugging trace woudl probably be useful:
>
>
> <lstname="debug">
> <strname="rawquerystring">
> churchill : roosevelt
> </str>
> <strname="querystring">
> churchill : roosevelt
> </str>
> <strname="parsedquery">
> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
> DisjunctionMaxQuery((isbn_t::)~0.01)
> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
> DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 |
> text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
> author2_unstem:"churchill roosevelt"~3^240.0 |
> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil
> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 |
> subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil
> roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 |
> text_unstem:"churchill roosevelt"~3^80.0)~0.01)
> </str>
> <strname="parsedquery_toString">
> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3)
> (title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil
> roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 |
> author_unstem:"churchill roosevelt"~3^400.0 |
> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
> author2_unstem:"churchill roosevelt"~3^240.0 |
> title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil
> roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 |
> subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil
> roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 |
> text_unstem:"churchill roosevelt"~3^80.0)~0.01
> </str>
> <lstname="explain"/>
> <strname="QParser">
> DisMaxQParser
> </str>
> <nullname="altquerystring"/>
> <nullname="boostfuncs"/>
> <lstname="timing">
> <doublename="time">
> 6.0
> </double>
> <lstname="prepare">
> <doublename="time">
> 3.0
> </double>
> <lstname="org.apache.solr.handler.component.QueryComponent">
> <doublename="time">
> 2.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.FacetComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.HighlightComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.StatsComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> <lstname="org.apache.solr.handler.component.DebugComponent">
> <doublename="time">
> 0.0
> </double>
> </lst>
> </lst>
>
>
>