You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by karl wettin <ka...@gmail.com> on 2007/01/14 18:18:46 UTC

Decorative cache (and Hits.setSearcher)

I just wrote this transparent search results cache. As it depends on  
just about everything in my big jira issue, and was thus posted in that.

It is based on this thread: <http://mail-archives.apache.org/mod_mbox/ 
lucene-solr-user/200608.mbox/%3c1154673229.5704.149.camel@localhost%3e>

This cache updates only the affected cached results on index  
modification. What results are affected at create time is determined  
by sending the cached query to an index that only contains the new  
documents. It is simpler at delete time. All cache is cleared upon  
optimization.

Hits-, HitCollect(ion)-, TopDocs- and TopFiledDocs-cache may all be  
turned on and off as wanted. You probably don't want to but both the  
Hits and the HitCollect(ion) that build the Hits could be cached, for  
instance.

Filters, sort orders and that is taken in to account.

I'm not certain I understand the possible effects of replacing the  
searcher in an instance of Hits, but this is what I do in order to  
keep cached instances valid when index is updated with changes that  
does not affect the results in the Hits. I would very much like to  
hear from someone about that.

Hit collection is cached in a ScoreDoc[]. It is ordered by document  
number for speedy searches at delete time. It is possible to turn on  
cache for the chronological order from the original hit collection,  
but that will eat an extra hits*64bit heap (in my enviroment).


For now there is only test code for the Hits-cache, but everything  
else seems to work fine(tm).

Except for beeing SoftReferenced, there is no cache priority handeling.
It is due to index modification notification implementation not  
compatible with remote searchables.


This is what an implementation looks like:

public class TestCachedSearcher extends TestCase {


   public void testCachedHits() throws Exception {

     final String f = "field";

     NotifiableIndex index = new NotifiableIndex(new RAMDirectoryIndex 
());
     SearcherCache cache = new SearcherCache(index,  
SearcherCache.HitCollectionCacheState.off, true, false, false);
     Analyzer analyzer = new StandardAnalyzer(Collections.emptySet());
     InterfaceIndexWriter writer;

     writer = index.openIndexWriter(analyzer, true);
     Document document = index.getDocumentFactory().newInstance();
     document.add(new Field(f, "Do you really want to go and live in  
that hotel for the winter?", Field.Store.NO, Field.Index.TOKENIZED,  
Field.TermVector.NO));
     writer.addDocument(document);
     writer.close();

     Hits andHits = cache.getSearcher().search(new TermQuery(new Term 
(f, "and")));
     assertEquals(andHits.length(), 1);
     assertTrue("expected cached results!", andHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "and"))));


     writer = index.openIndexWriter(analyzer, false);
     for (int i = 0; i < 10; i++) {
       document = index.getDocumentFactory().newInstance();
       document.add(new Field(f, "All work and no play makes Jack a  
dull boy.", Field.Store.NO, Field.Index.TOKENIZED,  
Field.TermVector.NO));
       writer.addDocument(document);
     }
     writer.close();


     Hits jackHits = cache.getSearcher().search(new TermQuery(new Term 
(f, "jack")));

     assertEquals(jackHits.length(), 10);
     assertTrue("expected cached results!", jackHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "jack"))));

     assertTrue(andHits != jackHits);

     assertTrue("expected cached results!", andHits !=  
cache.getSearcher().search(new TermQuery(new Term(f, "and"))));
     andHits = cache.getSearcher().search(new TermQuery(new Term(f,  
"and")));
     assertEquals(andHits.length(), 11);

     writer = index.openIndexWriter(analyzer, false);
     document = index.getDocumentFactory().newInstance();
     document.add(new Field(f, "Hello Danny. Come and play with us.  
Come and play with us, Danny. Forever... and ever... and ever.",  
Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.NO));
     writer.addDocument(document);
     writer.close();

     assertEquals(jackHits.length(), 10);
     assertTrue("expected cached results!", jackHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "jack"))));
     assertTrue("expected cached results!", andHits !=  
cache.getSearcher().search(new TermQuery(new Term(f, "and"))));

     andHits = cache.getSearcher().search(new TermQuery(new Term(f,  
"and")));
     assertEquals(andHits.length(), 12);


     writer = index.openIndexWriter(analyzer, false);
     document = index.getDocumentFactory().newInstance();
     document.add(new Field(f, "HERES'S JOHNNY", Field.Store.NO,  
Field.Index.TOKENIZED, Field.TermVector.NO));
     writer.addDocument(document);
     writer.close();

     assertEquals(cache.getSearcher().search(new TermQuery(new Term 
(f, "johnny"))).length(), 1);


     assertTrue("expected cached results!", jackHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "jack"))));
     assertTrue("expected cached results!", andHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "and"))));

     assertEquals(jackHits.length(), 10);
     assertEquals(andHits.length(), 12);

     IndexReader reader = index.openIndexReader();
     reader.deleteDocuments(new Term(f, "johnny"));
     reader.close();

     assertEquals(cache.getSearcher().search(new TermQuery(new Term 
(f, "johnny"))).length(), 0);

     assertTrue("expected cached results!", jackHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "jack"))));
     assertTrue("expected cached results!", andHits ==  
cache.getSearcher().search(new TermQuery(new Term(f, "and"))));

     assertEquals(jackHits.length(), 10);
     assertEquals(andHits.length(), 12);


   }


}





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Decorative cache (and Hits.setSearcher)

Posted by Chris Hostetter <ho...@fucit.org>.
: Perhaps I missunderstand you. The cache will remove the results from
: the cache if new documents matchig the query is added to the index,
: or if it has been optimized. It is only on delete events that the
: cached results are retouched. So what you write does not really
: apply. Or?

you might be writting a cahe that is that smart -- but the internal
caching done in the instances of the Hits object is not that smart --
that's what i'm talking about.

as I said in LUCENE-550...


> 3) i don't think the Hits.setSearcher method you added is safe ... i
> believe that at a minimum hitDocs, first, last, and weight all need to be
> reset -- weight's a tricky one since the instance doesn't currently hang
> on to the orriginal query.

...those are all internal variables in the Hits class that would need to
reiniialized like they are in the Hits constructor for a new
Hits.setSearcher method to be safe as far as i can tell.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Decorative cache (and Hits.setSearcher)

Posted by karl wettin <ka...@gmail.com>.
18 jan 2007 kl. 23.24 skrev Chris Hostetter:

> now you change the underlying Searcher/IndexReader out from under the
> Hits, repacingit with an updated index in which many new documents  
> have
> been added that contain both "Lucene" and "java" ... if you issued  
> a brand
> new search against this index document D may not score very high  
> with this
> influx of new documents, but that Hits object still has that doc  
> cached as
> the "best" doc with a good score ... only if you traverse enough of  
> the
> results to force Hits to refetch more docs will it ever update that  
> cache
> but odds are the client code isn't going ot "recheck" that all the  
> hits
> its iterated over so far havent' change -- it has no reason to even
> think it might need to do that since it doesn't know anything has
> changed -- as a result, iterating over the first 500 results from a  
> Hits
> object might return D twice, with two different scores.

Perhaps I missunderstand you. The cache will remove the results from  
the cache if new documents matchig the query is added to the index,  
or if it has been optimized. It is only on delete events that the  
cached results are retouched. So what you write does not really  
apply. Or?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Decorative cache (and Hits.setSearcher)

Posted by Chris Hostetter <ho...@fucit.org>.
: Looked in to this and came up with a perhaps better solution: adding
: yet another layer of decoration of the searcher passed to the Hits at
: construction time. This way the searcher can change without touching
: the Hits. I.e. the decorated searcher in the searched passed to the
: Hits will be replaced, and the Hits communicate with the decorator.

i still don't think that will work -- Hits is maintaining cached
information about documents which is dependent on the contents of
the index accessible via the Searcher it was given when it was constructed
... it doesn't matter if the Searcher rerence is really a decorator and
that decorator refrence doesn't change, what matters is that the
underlying Inded is expected to be invarient.

Consider a very small index, where nearly every document matches "Java"
and only one document D that matches "Lucene" ... you get a Hits back on a
search for "Java Lucene" and you look at the score of the first document
-- which is D, the only document containing both words, and whose score is
largely a result of idf(Lucene) and the fact that it's the only document
whose coordFactor is "1"

now you change the underlying Searcher/IndexReader out from under the
Hits, repacingit with an updated index in which many new documents have
been added that contain both "Lucene" and "java" ... if you issued a brand
new search against this index document D may not score very high with this
influx of new documents, but that Hits object still has that doc cached as
the "best" doc with a good score ... only if you traverse enough of the
results to force Hits to refetch more docs will it ever update that cache
but odds are the client code isn't going ot "recheck" that all the hits
its iterated over so far havent' change -- it has no reason to even
think it might need to do that since it doesn't know anything has
changed -- as a result, iterating over the first 500 results from a Hits
object might return D twice, with two different scores.

Frankly: even if you made it posisble to change the Searcher out from
under a Hits instance, i can't figure out why you would ever want to.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Decorative cache (and Hits.setSearcher)

Posted by karl wettin <ka...@gmail.com>.
Hoss Man [15/Jan/07 12:16 AM]
> 14 jan 2007 kl. 18.18 skrev karl wettin:
>>
>> I'm not certain I understand the possible effects of replacing the  
>> searcher in an instance of Hits, but this is what I do in order to  
>> keep cached instances valid when index is updated with changes  
>> that does not affect the results in the Hits. I would very much  
>> like to hear from someone about that.
>
> i don't think the Hits.setSearcher method you added is safe ... i  
> believe that at a minimum hitDocs, first, last, and weight all need  
> to be reset -- weight's a tricky one since the instance doesn't  
> currently hang on to the orriginal query.

Looked in to this and came up with a perhaps better solution: adding  
yet another layer of decoration of the searcher passed to the Hits at  
construction time. This way the searcher can change without touching  
the Hits. I.e. the decorated searcher in the searched passed to the  
Hits will be replaced, and the Hits communicate with the decorator.

I'll have to write a more complex test case to make sure if this  
works or not.

The problem remaining is what do to with Hits when one of the results  
within this Hits is removed from the index. Perhaps the easiest way  
is to extend Hits and keep track of what has been removed, but I  
would prefere to modify the actual hits. For now I just remove Hits  
from the cache on such events. With the hit collection cached it will  
still rebuilds much faster than replacing the query. But a "native"  
Hits cache would be nice.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Decorative cache (and Hits.setSearcher)

Posted by karl wettin <ka...@polarrose.com>.
Hoss Man [15/Jan/07 12:16 AM]
> 14 jan 2007 kl. 18.18 skrev karl wettin:
>>
>> I'm not certain I understand the possible effects of replacing the  
>> searcher in an instance of Hits, but this is what I do in order to  
>> keep cached instances valid when index is updated with changes  
>> that does not affect the results in the Hits. I would very much  
>> like to hear from someone about that.
>
> i don't think the Hits.setSearcher method you added is safe ... i  
> believe that at a minimum hitDocs, first, last, and weight all need  
> to be reset -- weight's a tricky one since the instance doesn't  
> currently hang on to the orriginal query.

Looked in to this and came up with a perhaps better solution: adding  
yet another layer of decoration of the searcher passed to the Hits at  
construction time. This way the searcher can change without touching  
the Hits. I.e. the decorated searcher in the searched passed to the  
Hits will be replaced, and the Hits communicate with the decorator.

I'll have to write a more complex test case to make sure if this  
works or not.

The problem remaining is what do to with Hits when one of the results  
within this Hits is removed from the index. Perhaps the easiest way  
is to extend Hits and keep track of what has been removed, but I  
would prefere to modify the actual hits. For now I just remove Hits  
from the cache on such events. With the hit collection cached it will  
still rebuilds much faster than replacing the query. But a "native"  
Hits cache would be nice.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org