You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Harris <ry...@gmail.com> on 2012/06/22 02:08:09 UTC

Any CommonGrams-inspired tricks to speed up other proximity query types?

CommonGrams provides a neat trick for optimizing slow phrase queries that
contain common words. (E.g. Hathi Trust has some
data<http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2>showing
how effective this can be.) Unfortunately, it does nothing for
other position-based queries (e.g. SpanQueries or sloppy phrase queries);
if these latter types of query contain an extremely common words, then they
can continue to run quite slowly.

The general lesson I take from CommonGrams is that by precomputing
structures that capture the proximity of 2 or more terms in relationship,
you can speed things up. Bigrams are probably the simplest such
proximity-capturing structure, but it seems like others could exist. What
I'm curious about is whether there are more advanced structures that could
potentially speed up proximity search more generally, ideally ones that
don't require rewriting huge chunks of Lucene.

For example, the following scheme might not be a good idea, but perhaps it
hints at the sorts of things one could try:

***

Suppose I wanted to optimize cases where a common word appears within 10
words of some other word. Maybe I could get partway there by injecting what
you might call "proximitygrams" into the token stream. For example, rather
than simply tokenizing like so:

fifteen big lazy dogs, a fish, and a cat went shopping --> fifteen big lazy
dogs a fish and a cat went shopping

I might tokenize like so:

fifteen big lazy dogs, a fish, and a cat went shopping --> fifteen big lazy
dogs a fish {and, within10_and_fifteen, within10_and_big,
within10_and_lazy, within10_and_dogs, within10_and_a, within10_and_fish,
within10_and_cat, within10_and_went, within10_and_shopping} a cat went
shopping

where the curly braces are meant to show overlapping tokens (i.e. with
positionIncrement==0). The "within10_x_y" is meant to record that terms x
and y appear within 10 words of one another.

Then, at query time, if the user wanted to find cases where "lazy" was
within 8 words of "and", I could get approximately what I wanted by
searching for token "within10_and_lazy". I'd still have to do some
post-processing to throw out results where "and" and "lazy" were actually 9
or 10 words apart rather than 8, but I might be part of the way there.

Another possible extension for this scheme would be to use payloads so that
each of the "within10" tokens actually knew the exact positions of the two
terms it was referring to. (I guess it could also record exactly how far
away said term was.) It seems like you could potentially use this to
optimize SpanNear queries between common terms, or sloppy phrase queries.

I don't know if anything like the above would actually pay for itself in
terms of increased index size and code complexity. But I wanted to check if
this or anything remotely along these lines sounded promising to anyone.