You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marcos Juarez Lopez <mj...@gmail.com> on 2013/10/16 19:10:30 UTC

Extending Query class for custom query evaluation.

Posted something similar some time ago, but didn't get any responses, so I
thought I'd try again with more details.

We allow end-user queries that have our own proprietary query language,
which we then translate to a Lucene Query* AST.  This has worked well for
us. However, a few of the operators we allow have extremely high document
frequency, on the order of > 60%. End users sometimes want to get a count
of all documents matching that field value.   Since we're trying to get as
close to possible to the 2.1B document limit per index, this type of query
can take more than 20 seconds.  Most of these operators are boolean values,
which we could cache externally ahead of time in a bit-set representation
in memory, using docID as a pointer to the array. Based on preliminary
testing, we know that using bitsets can significantly speed up these count
queries.

The question then is how to tie-in the bitset implementation to query
evaluation. We considered Filters, but it seems like those are only
particularly useful when you want to filter a whole result set. In our case
these clauses can appear at any level of the query tree. The next thought
is creating a custom implementation of the Query class (similar to
TermQuery, etc), that knows how to evaluate based on the bitset rather than
going to the index itself. This looks possible but fairly involved.

It seems like this can't be a new problem, so we're wondering if there's
pre-existing work here that we're missing to make this easier. Any
thoughts?

Thanks,

Marcos Juarez