You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2008/09/13 19:26:43 UTC

Extending query parser with MinShouldMatch syntax

Hi,

I would like to suggest an extension to Lucene's query syntax, which will
allow application developers to send query constraints with a MinShouldMatch
value to the search engine, from the client application. Such constraints
are for example ACL (security information) and other filters on the queries.
Client applications simply have no way to tell the back-end to consider some
filters as min-should-match (or msm).

Suppose that I propose a file-type filter to the user, and the user typed
some keywords, like "hello world". The user gets back results, and he now
wants to filter those results by select "PDF" from the file-type filter. The
only query the client application can send to the back-end is "hello world
+filetype:pdf". But that doesn't work as expected. If queries are run with
OR operator as the default, then the documents that will be returned are
those that include filetype:pdf, and may or may not include "hello world".
This is not what the user expected though.

The only option today for the application is to parse the query, understand
that this is a msm filter (though how will it do it is not very obvious, and
not easily extendable to other filters) and set a msm on the resulting
query.

Instead, we could offer the following syntax:
- term# - defaults to msm '1'.
- term#<value> - set msm according to the specified value

What do you think?

Shai

Re: Extending query parser with MinShouldMatch syntax

Posted by Chris Hostetter <ho...@fucit.org>.
: Suppose that I propose a file-type filter to the user, and the user typed
: some keywords, like "hello world". The user gets back results, and he now
: wants to filter those results by select "PDF" from the file-type filter. The
: only query the client application can send to the back-end is "hello world
: +filetype:pdf". But that doesn't work as expected. If queries are run with
: OR operator as the default, then the documents that will be returned are
: those that include filetype:pdf, and may or may not include "hello world".
: This is not what the user expected though.

I'm really not understanding what that example has to do with 
minShouldMatch ... the fundemental problem in your example is that if you 
start with a query for...
	"hello world"
...and then want to restrict it to only docs that also match...
	filetype:pdf
...the combined query must have *both* clauses marekd as mandatory...
	+"hello world" +filetype:pdf

minShouldMatch doesn't even factor in at all.

Independent of that, if you wnat ot add minShouldMatch support to 
QueryParser, there are two fairly straightforward ways to go, depending on 
how generalized you wnat support to be...

1) minShouldMatch set on all BooleanQueries (as a function of length)  

This is hte appraoch the DisMaxQueryParser in Solr takes ... you override 
the getBooleanQuery method in QueryParser, delegate to super, and then 
modify the BooleanQuery returned setting minShouldMatch based on some 
function of the number of clauses it already contains.  the version in 
Solr supports a gramer for deciding what it should be relative various 
cut-off points as either an absolute number or a percentage...

http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

2) overload the use of "~" in the parser grammer

instead of adding a new special character to the grammer (i think you 
suggested '#') which cuold break back compatibility you might want to 
consider modifying the grammer to recognize the '~' character when it 
follows a close paren as an indication of minShouldMatch on the boolean 
query those parens wrap.  Since '~' is currently used for specifying 
slop on phrase queries and fuzzyniess on fuzzy queries it's already a 
reserved character.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org