You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2012/11/23 05:10:24 UTC

Does anyone have tips on managing cached filters?

I recently implemented the ability for multiple users to open the
index in the same process ("whoa", you might think, but this has been
a single user application forever and we're only just making the
platform capable of supporting more than that.)

I found that filters are being stored twice and since it's basically
the same filter and filters can be pretty large, I set out to try and
do something about that.

Problem is, I can't figure out when to invalidate the things.

With a single user it was easy. If the user tags an item, you
invalidate the TagFilter for that tag and of course the AnyTagFilter.
(Yes, we could instead update the filtered bitset immediately. That's
an improvement for another day.)

With multiple users I have two additional scenarios which I can't find
a single consistent solution to:

* Two users, each opening different indices. If the first user tags
something, it should only invalidate the filters for their readers and
not the other users'.

* Two users, opening the same index but one is looking at a newer
copy. So they might share some segments, but not all the segments. If
the first user tags something, it should invalidate all the filters
for that index, whether the first user has them open or not, otherwise
the second user will see out of date information.

The obvious trivial solutions satisfy exactly one of the above but not both:

1. when invalidating, walk the tree of index readers the user has open
and invalidate any filter cached for those readers. Suits the first
scenario but not the second.

2. just invalidate every doc ID set for every reader. Suits the second
scenario but not the first. but at least it is technically correct. It
won't give bad results, just bad performance. So it's the better of
the two at the moment, and probably still better than keeping the same
filter bit set in memory twice.

As for actually doing the invalidation, CachingWrapperFilter itself
doesn't appear to have any mechanism for invalidation at all, so I
imagine I will be building a variation of it with additional methods
to invalidate parts of the cache.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Trejkaz <tr...@trypticon.org>.

On Thu, Nov 29, 2012 at 4:57 PM, Trejkaz <tr...@trypticon.org> wrote:
> doubt we're not

Rats. Accidentally double-negatived that. I doubt we are the only ones. *

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Arjen van der Meijden <ac...@tweakers.net>.

We have something similar with documens that can be tagged (and have 
many other relations). But for the matter of search we have two 
distinctions from your aproach:
- We do actually index the relation's id (i.e. the tag's id) as part of 
the lucene-document and update the document if that relation betweenthe 
item and a tag is changed. So a filter on some 'tag' becomes a trivial 
termsFilter.addTerm('tagId', '12345).
- We use Lucene only as a base of the results we're going to send back 
to the user. I.e. we get results from Lucene and than do some more 
processing on them.

But that last distinction is actually because we started with an 
in-memory "database" application that did basically what Lucene already 
does, but just with more complicated objects and more complicated 
facet-extraction, more complicated filters, etc. So Lucene is only used 
when we need keyword-filtering and we help Lucene do that quickly by 
offering some Filters derived from the rest of the application's work.
And yes, if we were to redesign the application, it could become 
different :P

Best regards,

Arjen

On 29-11-2012 6:57 Trejkaz wrote:
> On Wed, Nov 28, 2012 at 6:28 PM, Robert Muir <rc...@gmail.com> wrote:
>> My point is really that lucene (especially clear in 4.0) assumes
>> indexreaders are immutable points in time. I don't think it makes sense for
>> us to provide any e.g. filtercaching or similar otherwise, because this is
>> a key simplification to the design. If you depart from this, by scoring or
>> filtering from mutable stuff outside the inverted index, things are likely
>> going to get complicated.
>
> Whereas it would be lovely to live in a land of rainbows and unicorns
> where all the data you ever want to use is in the text index and all
> filters can be written as a query, that simply isn't the case for us
> and I very much doubt we're not the only ones in this situation.
>
> Sure, things are complicated. Anything except the most trivial forum
> search application is complicated.
>
> Well, the situation as it stands now is that when a filter is
> invalidated, it happens across all stores which are currently open.
> That means that results are at least correct, but after invalidating a
> filter, a little more work than necessary is required to populate the
> cache again. For certain filters (like word lists) this is necessary
> anyway, since adding a word might invalidate any store. For others
> like tags, I was hoping there would be some way to selectively
> invalidate only certain readers. But it seems like that isn't the
> case, so I will probably have to add a third level of caching to cache
> these sorts of filter per-store instead of globally.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Trejkaz <tr...@trypticon.org>.

On Wed, Nov 28, 2012 at 6:28 PM, Robert Muir <rc...@gmail.com> wrote:
> My point is really that lucene (especially clear in 4.0) assumes
> indexreaders are immutable points in time. I don't think it makes sense for
> us to provide any e.g. filtercaching or similar otherwise, because this is
> a key simplification to the design. If you depart from this, by scoring or
> filtering from mutable stuff outside the inverted index, things are likely
> going to get complicated.

Whereas it would be lovely to live in a land of rainbows and unicorns
where all the data you ever want to use is in the text index and all
filters can be written as a query, that simply isn't the case for us
and I very much doubt we're not the only ones in this situation.

Sure, things are complicated. Anything except the most trivial forum
search application is complicated.

Well, the situation as it stands now is that when a filter is
invalidated, it happens across all stores which are currently open.
That means that results are at least correct, but after invalidating a
filter, a little more work than necessary is required to populate the
cache again. For certain filters (like word lists) this is necessary
anyway, since adding a word might invalidate any store. For others
like tags, I was hoping there would be some way to selectively
invalidate only certain readers. But it seems like that isn't the
case, so I will probably have to add a third level of caching to cache
these sorts of filter per-store instead of globally.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Nov 28, 2012 at 12:27 AM, Trejkaz <tr...@trypticon.org> wrote:

> On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir <rc...@gmail.com> wrote:
> >
> > I don't understand how a filter could become invalid even though the
> reader
> > has not changed.
>
> I did state two ways in my last email, but just to re-iterate:
>
> (1): The filter reflects a query constructed from lines in a text
> file. If some other application modifies the text file, that filter is
> now invalid.
>
> (2): The filter reflects the results of an SQL query against a
> separate database. If someone inserts a new value into that table,
> then that filter is now invalid.
>
> Case 1 occurs for things like word lists. Case 2 occurs for things
> like tags. Neither of these would ever be possible to implement purely
> using Lucene, so it is a fact of life that they will become invalid
> for reasons other than the reader changing.
>
>
My point is really that lucene (especially clear in 4.0) assumes
indexreaders are immutable points in time. I don't think it makes sense for
us to provide any e.g. filtercaching or similar otherwise, because this is
a key simplification to the design. If you depart from this, by scoring or
filtering from mutable stuff outside the inverted index, things are likely
going to get complicated.

Re: Does anyone have tips on managing cached filters?

Posted by Trejkaz <tr...@trypticon.org>.

On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir <rc...@gmail.com> wrote:
>
> I don't understand how a filter could become invalid even though the reader
> has not changed.

I did state two ways in my last email, but just to re-iterate:

(1): The filter reflects a query constructed from lines in a text
file. If some other application modifies the text file, that filter is
now invalid.

(2): The filter reflects the results of an SQL query against a
separate database. If someone inserts a new value into that table,
then that filter is now invalid.

Case 1 occurs for things like word lists. Case 2 occurs for things
like tags. Neither of these would ever be possible to implement purely
using Lucene, so it is a fact of life that they will become invalid
for reasons other than the reader changing.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 27, 2012 at 6:17 AM, Trejkaz <tr...@trypticon.org> wrote:

>
> Ah, yeah... I should have been clearer on what I meant there.
>
> If you want to make a filter which relies on data that isn't in the
> index, there is no mechanism for invalidation. One example of it is if
> you have a filter which essentially constructs a query based on the
> contents of a text file (like a word list.) Another example is with
> tagging, with the tags stored in an external database.
>

I don't understand how a filter could become invalid even though the reader
has not changed.

If this is the case in your design, then you have much bigger problems.

Re: Does anyone have tips on managing cached filters?

Posted by Trejkaz <tr...@trypticon.org>.

On Tue, Nov 27, 2012 at 9:31 AM, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz <tr...@trypticon.org> wrote:
>
>>
>> As for actually doing the invalidation, CachingWrapperFilter itself
>> doesn't appear to have any mechanism for invalidation at all, so I
>> imagine I will be building a variation of it with additional methods
>> to invalidate parts of the cache.
>>
>>
> Actually it does, it uses a weakhashmap keyed on either the segment
> (core+deletes) or just the segment's core.

Ah, yeah... I should have been clearer on what I meant there.

If you want to make a filter which relies on data that isn't in the
index, there is no mechanism for invalidation. One example of it is if
you have a filter which essentially constructs a query based on the
contents of a text file (like a word list.) Another example is with
tagging, with the tags stored in an external database.

At the moment we use a separate level of filter cache which asks the
contained filter whether it's still OK to use (if the timestamp on the
file changes, it gets ejected from the cache.) I suspect the same
cache is useful anyway, as it also holds onto the filter instances so
that they don't get collected too soon (filters can come out of our
query parser, so the caller can't conveniently hold onto the instances
in all cases. Sometimes they do two similar queries which happen to
call the same filter, so caching the entire resulting query doesn't
help either.)

An interesting, somewhat-related issue is that for some filters, we
can't keep the contents of the file itself in memory due to size
limits, so we have to read it on the fly. When there are multiple
segments, the file gets read multiple times. So it's a rare case where
computing the filter across all readers might actually come out faster
than computing it per-segment...

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Does anyone have tips on managing cached filters?

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz <tr...@trypticon.org> wrote:

>
> As for actually doing the invalidation, CachingWrapperFilter itself
> doesn't appear to have any mechanism for invalidation at all, so I
> imagine I will be building a variation of it with additional methods
> to invalidate parts of the cache.
>
>
Actually it does, it uses a weakhashmap keyed on either the segment
(core+deletes) or just the segment's core.

The former (recacheDeletes=true) "bakes the deletes" into the bitset, but
at the expense of being invalidated much more often.
The latter (recacheDeletes=false) intersects the deletes at search time, so
slightly slower but stays cached even as documents become deleted for that
segment.