You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2008/09/13 19:26:43 UTC
Extending query parser with MinShouldMatch syntax
Hi,
I would like to suggest an extension to Lucene's query syntax, which will
allow application developers to send query constraints with a MinShouldMatch
value to the search engine, from the client application. Such constraints
are for example ACL (security information) and other filters on the queries.
Client applications simply have no way to tell the back-end to consider some
filters as min-should-match (or msm).
Suppose that I propose a file-type filter to the user, and the user typed
some keywords, like "hello world". The user gets back results, and he now
wants to filter those results by select "PDF" from the file-type filter. The
only query the client application can send to the back-end is "hello world
+filetype:pdf". But that doesn't work as expected. If queries are run with
OR operator as the default, then the documents that will be returned are
those that include filetype:pdf, and may or may not include "hello world".
This is not what the user expected though.
The only option today for the application is to parse the query, understand
that this is a msm filter (though how will it do it is not very obvious, and
not easily extendable to other filters) and set a msm on the resulting
query.
Instead, we could offer the following syntax:
- term# - defaults to msm '1'.
- term#<value> - set msm according to the specified value
What do you think?
Shai
Re: Extending query parser with MinShouldMatch syntax
Posted by Chris Hostetter <ho...@fucit.org>.
: Suppose that I propose a file-type filter to the user, and the user typed
: some keywords, like "hello world". The user gets back results, and he now
: wants to filter those results by select "PDF" from the file-type filter. The
: only query the client application can send to the back-end is "hello world
: +filetype:pdf". But that doesn't work as expected. If queries are run with
: OR operator as the default, then the documents that will be returned are
: those that include filetype:pdf, and may or may not include "hello world".
: This is not what the user expected though.
I'm really not understanding what that example has to do with
minShouldMatch ... the fundemental problem in your example is that if you
start with a query for...
"hello world"
...and then want to restrict it to only docs that also match...
filetype:pdf
...the combined query must have *both* clauses marekd as mandatory...
+"hello world" +filetype:pdf
minShouldMatch doesn't even factor in at all.
Independent of that, if you wnat ot add minShouldMatch support to
QueryParser, there are two fairly straightforward ways to go, depending on
how generalized you wnat support to be...
1) minShouldMatch set on all BooleanQueries (as a function of length)
This is hte appraoch the DisMaxQueryParser in Solr takes ... you override
the getBooleanQuery method in QueryParser, delegate to super, and then
modify the BooleanQuery returned setting minShouldMatch based on some
function of the number of clauses it already contains. the version in
Solr supports a gramer for deciding what it should be relative various
cut-off points as either an absolute number or a percentage...
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html
2) overload the use of "~" in the parser grammer
instead of adding a new special character to the grammer (i think you
suggested '#') which cuold break back compatibility you might want to
consider modifying the grammer to recognize the '~' character when it
follows a close paren as an indication of minShouldMatch on the boolean
query those parens wrap. Since '~' is currently used for specifying
slop on phrase queries and fuzzyniess on fuzzy queries it's already a
reserved character.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org