You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Timo Nentwig <lu...@nitwit.de> on 2007/12/11 13:27:59 UTC

Caching FuzzyQuery

Hi!

Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a 
caching decorator? A WeakHashMap with key==IndexReader and value==LRU of 
BooleanQueries.

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Timo Nentwig <lu...@nitwit.de>.

On Tue, 11 Dec 2007, Timo Nentwig wrote:

> Date: Tue, 11 Dec 2007 13:27:59 +0100
> From: Timo Nentwig <lu...@nitwit.de>
> Reply-To: java-dev@lucene.apache.org
> To: java-dev@lucene.apache.org
> Subject: Caching FuzzyQuery
> 
> Hi!
>
> Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> BooleanQueries.

So, didn't become a popular approach/topic. Anyway, I wrote an aspect and 
it does the job for a couple of months now, so I'd like to share it.

HTH
Timo


In order to enable runtime AspectJ weaving start your JVM with

    -javaagent:./foo/bar/lib/aspectjweaver.jar

and place the following file called "aop.xml" in META-INF.

<aspectj>
         <aspects>
                 <aspect name="foo.bar.FuzzyQueryCachingAspect" />
         </aspects>

         <weaver options="-verbose -showWeaveInfo">
                 <include within="org.apache.lucene.search.*" />
                 <include within="org.apache.lucene.index.*" />
         </weaver>
</aspectj>


@Aspect
public class FuzzyQueryCachingAspect
{
         private static final Log LOG = LogFactory.getLog( 
FuzzyQueryCachingAspect.class );
         // your cache impl goes here
         private static final Caching CACHE = Caching.instance( 
Caching.FUZZY );

         @Before( "execution(public void 
org.apache.lucene.index.IndexReader+.close())" )
         public void flush( final JoinPoint jp ) throws Throwable
         {
                 assert jp.getThis() == jp.getTarget();

                 final IndexReader r = (IndexReader)jp.getThis();
                 final Directory d = r.directory();

                 final String group = name( d );

                 CACHE.remove( group );
                 LOG.info( "Removed from " + CACHE + ": " + group );
         }

         @Around( "execution(public org.apache.lucene.search.Query
org.apache.lucene.search.FuzzyQuery.rewrite(org.apache.lucene.index.IndexReader)) 
&& args(reader)" )
         public Object cache( final IndexReader reader, final 
ProceedingJoinPoint jp ) throws Throwable
         {
                 assert jp.getThis() == jp.getTarget();

                 final FuzzyQuery q = (FuzzyQuery)jp.getThis();
                 final String term = q.getTerm().text();

                 final String group = name( reader.directory() );
                 final String key = term + "~" + q.getMinSimilarity();

                 final Object e = CACHE.get( group, key );
                 if( e != null )
                 {
                         LOG.debug( term + " found in cache" );
                         return e;
                 }

                 final long t0 = System.currentTimeMillis();
                 final Query fuzzy = (Query)jp.proceed();
                 CACHE.put( group, key, fuzzy );
                 LOG.info( term + " rewritten in " + 
(System.currentTimeMillis() - t0) + "ms (" + group + ")" );

                 return fuzzy;
         }

         private static final String name( final Directory d )
         {
                 if( d instanceof FSDirectory ) return 
((FSDirectory)d).getFile().getAbsolutePath();

                 return String.valueOf( d.hashCode() );
         }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Chris Hostetter <ho...@fucit.org>.
: > Hoss means calling rewrite on the *result* of a rewrite.
: 
: Uh? That's what I mean (propose), too... But currently nothing's cached at 
: all.
: 
: Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key = 
: IndexReader and value = LRU.

yes .. you can do that.  you don't need anything in the core lucene 
Searchers to do it for you (ie: nothing needs refactored out, because 
redundent rewrite calls are cheap) ... in the paragraph below, 
"application" refers to your code, "Lucene" refers to the code in the 
lucene JAR...

: > So the application would call rewrite, cache the resulting query, and
: > then use that already rewritten query from then on.  Lucene wouldn't
: > know it had already been rewritten, and would call rewrite again, but
: > it should be fast.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Timo Nentwig <lu...@nitwit.de>.
On Saturday 15 December 2007 20:48:38 Yonik Seeley wrote:
> On Dec 15, 2007 2:23 PM, Timo Nentwig <lu...@nitwit.de> wrote:
> > On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> > > : Actually FuzzyQuery.rewrite() is pretty expensive so why not
> > > : introduce a caching decorator? A WeakHashMap with key==IndexReader
> > > : and value==LRU of BooleanQueries.
> > >
> > > Applications are certainly welcome to do this (there is nothing to stop
> > > you from calling rewrite before passing the query to your Searcher, i
> > > believe the overhead of calling rewrite on a query that's already been
> > > rewritten is fairly low) but I don't think it would be a good idea to
> > > add
> >
> > Why should subsequent rewrites be faster?
>
> Hoss means calling rewrite on the *result* of a rewrite.

Uh? That's what I mean (propose), too... But currently nothing's cached at 
all.

Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key = 
IndexReader and value = LRU.

> So the application would call rewrite, cache the resulting query, and
> then use that already rewritten query from then on.  Lucene wouldn't
> know it had already been rewritten, and would call rewrite again, but
> it should be fast.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm against caching in general because you always run into some hard to 
: understand and examine problem but this seems to be one of the rare cases 
: where caching makes sense.

For you maybe, but i've actually seen very few instances in practice where 
people who are speed concious use FuzzyQueries (or other queries with 
really expensive rewrite methods) to a large degree .... in the few casees 
people do:
  1) they usually go whole hog and cache search results 
     (ie: query+sort => list of docs)
  2) they want direct control of the cache.

: Well, at least the existing of such an decorator (which you explicitly have to 
: use) will give you a hint that this is performance hot spot. I took me quite 
: some time to figure it out...

i would argue that for a lot of people something like "Solr" is that 
decorator ... but if you'd like to submit a contrib that does something 
like this with a nice generic API by all means please do.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 15, 2007 2:23 PM, Timo Nentwig <lu...@nitwit.de> wrote:
> On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> > : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> > : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> > : BooleanQueries.
> >
> > Applications are certainly welcome to do this (there is nothing to stop
> > you from calling rewrite before passing the query to your Searcher, i
> > believe the overhead of calling rewrite on a query that's already been
> > rewritten is fairly low) but I don't think it would be a good idea to add
>
> Why should subsequent rewrites be faster?

Hoss means calling rewrite on the *result* of a rewrite.
So the application would call rewrite, cache the resulting query, and
then use that already rewritten query from then on.  Lucene wouldn't
know it had already been rewritten, and would call rewrite again, but
it should be fast.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Caching FuzzyQuery

Posted by Timo Nentwig <lu...@nitwit.de>.
On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> : BooleanQueries.
>
> Applications are certainly welcome to do this (there is nothing to stop
> you from calling rewrite before passing the query to your Searcher, i
> believe the overhead of calling rewrite on a query that's already been
> rewritten is fairly low) but I don't think it would be a good idea to add

Why should subsequent rewrites be faster? The query is being rewritten every 
time over and over again. Even *if* you can profit by buffered IO you sill 
have a plenty of string levenshtein OPs.

I'm against caching in general because you always run into some hard to 
understand and examine problem but this seems to be one of the rare cases 
where caching makes sense.

I attached a small test app, the index contains 2.2 million docs and 5 million 
terms, I search for a pretty common term which was rewritten to 15 terms and 
hit roughly 4.000 docs (I also tried a term that was rewritten to 1 term and 
hit about 300 docs, made no difference):

rewritten in 809
Overall search time: 842
rewritten in 271
Overall search time: 274
rewritten in 216
Overall search time: 219
rewritten in 180
Overall search time: 182
rewritten in 184
Overall search time: 186
rewritten in 220
Overall search time: 226
rewritten in 207
Overall search time: 208
rewritten in 181
Overall search time: 183
rewritten in 183
Overall search time: 185
rewritten in 180
Overall search time: 181

$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0    757    298     56    384    0    0    21    36   39    9  5  1 94  0

> something like this to the core ...for starters we are trying to move
> away from "hidden" caches like this that are not transparent (and

Well, at least the existing of such an decorator (which you explicitly have to 
use) will give you a hint that this is performance hot spot. I took me quite 
some time to figure it out...

> controllable) but the users because they have the potential to eat up a
> lot of ram.  But also: he amount of time needed to rewrite the query is
> probably not vastly more expensive then the anout of time to execute the
> search .. you might as well cache the entire result keyed off of the
> orriginal query (and not just the rewritten query object).
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



Re: Caching FuzzyQuery

Posted by Chris Hostetter <ho...@fucit.org>.
: Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a 
: caching decorator? A WeakHashMap with key==IndexReader and value==LRU of 
: BooleanQueries.

Applications are certainly welcome to do this (there is nothing to stop 
you from calling rewrite before passing the query to your Searcher, i 
believe the overhead of calling rewrite on a query that's already been 
rewritten is fairly low) but I don't think it would be a good idea to add 
something like this to the core ...for starters we are trying to move 
away from "hidden" caches like this that are not transparent (and 
controllable) but the users because they have the potential to eat up a 
lot of ram.  But also: he amount of time needed to rewrite the query is 
probably not vastly more expensive then the anout of time to execute the 
search .. you might as well cache the entire result keyed off of the 
orriginal query (and not just the rewritten query object).


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org