You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by karl wettin <ka...@gmail.com> on 2006/08/04 08:33:49 UTC

a thought on cache

I don't do Solr, but had this thought that might be interesting: instead
of associating cache with an IndexSearcher, it could stand by it self.
When new documents are inserted (if I understand it right, Solr have
some kind of notification system for this) the cached queries are placed
on the new documents (indexed in a Memory- or InstantiatedIndex [Lucene
issue 550]) to see if they affect the cached results. If not, cache is
kept. If, cache is rebuilt or removed. With pre-tokenized fields (Lucene
issue 580) it would not consume that much resources at all, but perhaps
that will not fit in the Solr-scheme.

Any immediate comments on that? I'd like to implement something like
this for my self as I notice the CPU working a bit harder than I want it
to every time I update an index.

Re: a thought on cache

Posted by karl wettin <ka...@gmail.com>.

On Thu, 2006-08-03 at 23:53 -0700, Chris Hostetter wrote:

>   1) as new docs come in, add them to a purely in memory index
>   2) when it becomes time to "commit" the new documents, test all queries
>      in the cache against this in memory index.
>   3) any query in the cache which has a hit on this in memory index should
>      be invalidated, any query which does not have a hit is still valid.

You got it.

> ...this could probably work if the index was purely additive 

> check if one of the cached queries matched on the deleted document

Hmm, didn't see that one coming. Quick and dirt would be to rebuild
the document for original source. Have to think of a better solution
than that though.

> the next segment merge could collapse doc ids above deleted docs which
> were totally unrelated to any docs that were added or deleted -- so
> you would think they are still valid even though the doc ids in the
> cache don't correspond to the same documents anymore.

This is not the first time I think of low level hooks in the index. If
an optimization could report changes this would not be a problem, or?

> while the "old" IndexSearcher is still being used by external requests
> (and still using it's cache) a new "on deck" IndexSearcher is opened,
> and an internal thread is running queries against it (the results of

I do something similar to that. But all them queries (in some cases
tens of thousands and a frequently updated index) hogs more CPU than I
think it has to. I'm low on CPU (spent on real time collaborative
filtering et.c.) but have more or less an unlimited amount of RAM.

Re: a thought on cache

Posted by Chris Hostetter <ho...@fucit.org>.

: I don't do Solr, but had this thought that might be interesting: instead
: of associating cache with an IndexSearcher, it could stand by it self.
: When new documents are inserted (if I understand it right, Solr have
: some kind of notification system for this) the cached queries are placed
: on the new documents (indexed in a Memory- or InstantiatedIndex [Lucene
: issue 550]) to see if they affect the cached results. If not, cache is
: kept. If, cache is rebuilt or removed. With pre-tokenized fields (Lucene
: issue 580) it would not consume that much resources at all, but perhaps
: that will not fit in the Solr-scheme.

I may be missunderstanding your idea, so let me reword it the way i
understand it and you tell me if i'm missing something...

  1) as new docs come in, add them to a purely in memory index
  2) when it becomes time to "commit" the new documents, test all queries
     in the cache against this in memory index.
  3) any query in the cache which has a hit on this in memory index should
     be invalidated, any query which does not have a hit is stll valid.

...this could probably work if the index was purely additive (ie: only
ever grew over time) but I don't think it's feasible in an index in which
delets are executed ... not only would you need to check if one of hte
cached queries matched on the deleted document, but the next segment merge
could collapse doc ids above deleted docs which were totally unrelated to
any docs that were added or deleted -- so you would htink they are still
valid even though the doc ids in the cache don't corrispond to the same
documents anymore.

: Any immediate comments on that? I'd like to implement something like
: this for my self as I notice the CPU working a bit harder than I want it
: to every time I update an index.

Solr reduces this impact by letting you configure "cache warming" when
changes are commited, the gist of it is that while the "old" IndexSearcher
is still being used by external requests (and still using it's cache) a
new "on deck" IndexSearcher is opened, and an internal thread is running
queries against it (the results of which are being cached) for all of the
"best"  items in the previous cache.  once a certain number of cache
enteries have been seeded, the "on deck" INdexSearcher is swapped in and
used for all future queries.

you can even configure custom actions to take place on commit or optimize
(using "listeners") if you want different prepopulation of your caches
each time.  I for example wrote a warming plugin that crawls the metadata
in my index, and caches all sorts of Filters for them up to a configurable
amount of time, at which point it gives up -- i have it configured to be
used on server start up, aka "firstSearcher".



-Hoss

Re: a thought on cache

Posted by karl wettin <ka...@gmail.com>.

On Fri, 2006-08-04 at 11:18 -0400, Yonik Seeley wrote:
> On 8/4/06, karl wettin <ka...@gmail.com> wrote:
> > When new documents are inserted (if I understand it right, Solr have
> > some kind of notification system for this) the cached queries are placed
> > on the new documents (indexed in a Memory- or InstantiatedIndex [Lucene
> > issue 550]) to see if they affect the cached results.
> 
> It would be complicated enough for a filter cache (just the docs that
> match), but doesn't even seem possible for a query cache where
> relevancy scores could change due to changes in idf.  Perhaps doable
> if one were willing to drop all idf terms from scoring...

Ouch. Yes, this is a hard nut to crack. I'll most definitely sleep on it
for a couple of night though.

Thanks all for the input!

Re: a thought on cache

Posted by Yonik Seeley <yo...@apache.org>.

On 8/4/06, karl wettin <ka...@gmail.com> wrote:
> When new documents are inserted (if I understand it right, Solr have
> some kind of notification system for this) the cached queries are placed
> on the new documents (indexed in a Memory- or InstantiatedIndex [Lucene
> issue 550]) to see if they affect the cached results.

It would be complicated enough for a filter cache (just the docs that
match), but doesn't even seem possible for a query cache where
relevancy scores could change due to changes in idf.  Perhaps doable
if one were willing to drop all idf terms from scoring...

-Yonik