You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/01/09 03:16:47 UTC

Re: Dismax Minimum Match/Stopwords Bug

: Hmm, that makes sense to me - however I still think that even if we have mm
: set to "2" and we have "the 7449078" it should still match 7449078 in a
: productId field (it does not:
: http://zeta.zappos.com/search?department=&term=the+7449078). This seems like
: it works against the way one would reasonably expect it to - that stopwords
: shouldn't impact the counts for mm (so, "the 7449078" would count as 1 term
: for mm since "the" is a stopword).

this is back to the original "problem"...

"stopwords" is an analyzer concept; "minShouldMatch" is 
BooleanQuery/DisMaxQueryParser concept ... if all of the analyzers for all 
of your fields agree on the list of stopwords, then q=the+7449078 will 
result in "the" getting thrown out and you'll only have one clause.  but 
if one of fields has an anayler that says "the" is a valid term, then it's 
a valid term and it gets a clause in the query.  if it gets a clause in 
the query, then it factors into the minShouldMatch calculation.

in that particular situation i believe the solution you want is to use the 
same stopwords like you have on other fields for your productId field as 
well, so "the" doesn't get a query clause at all ... unless you want 
q=the+7449078 to return product#7449078 if and only if it also has "the" 
in it's productId field.

: We have people asking for "the north" to return results from a brand called
: "the north face" - but it doesn't, and can't, because of this mm issue.

it may not work for you right now, but that doesn't mean it can't :)  ... 
i'm not sure why it wouldn't actually.

consider a query like this...
 
   q=the north&qf=manu^2 prodName^1 desc^0.5&pf=...&mm=66%

let's say that "desc" uses stop words, but prodName and manu don't 
(because we know we have manufacturer and product names like "the north 
face"). we're going to get one DisjunctionMaxQuery for "the" (on the manu 
and prodName fields) and one DisjunctionMaxQuery for "north" (on manu, 
prodName, and desc) and that's 2 clauses on a BooleanQuery whose 
mminShouldMatch is going to be 2 (because 66% of 2 rounded up is 2)  so 
now all products with "the" and "north" in their manufacturer name *OR* 
product name will match -- even if it's "the" in manu and "north" 
in prodName.  products will even match if the only place they contain 
"north" is in the description -- but only if they also contain "the" in 
manu or productName.  if you think "that's silly, why is 'the' required i 
want it to be a stopword!" then the solution is make it a stopword 
*everywhere* (inlcuding manu and prodName) ... since it's not a stopword, 
it's considered significant, so it needs to match.


-Hoss