You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/06/02 16:39:06 UTC

Question on CachingWrapperFilter

Hi

I read CWF today and initially I thought this is going to cache a Filter
in-memory for me, so that I can more efficiently use it for subsequent
searches. But I learned that all it does is cache the DocIdSet returned by
the wrapped Filter.

This is good in and on itself, but I wonder if we shouldn't go the extra
mile and wrap stuff in memory for Filters which don't operate from memory.
For example - I have a Filter which reads information from a Payload as it's
iterated on, so it doesn't keep anything in memory (it's per-user
information, so I haven't decided yet if I can afford caching it in-memory
and whether it will be beneficial). Caching that sort of Filter by CWF will
obviously not improve anything.

I'm not sure what to do here:
1. Just reflect that in the javadoc (it is very confusing saying "Wraps
another filter's result and caches it", which is not true)
2. Introduce a class which takes a Filter and loads it into memory (I think
I read an issue/discussion about this), to an OpenBitSet for example (but we
need to know the number of results in advance, or grow the array as we go
along).
3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility, and
cache the Filters w/ the user as Key.

I will probably need to do the second part of (3) anyway, so I'm asking
whether such a utility is useful to exist in Lucene, and perhaps there's
already one (I thought I read somewhere about the ability to execute a Query
and get back a Filter, or use the results as a Filter)? I looked at
QueryWrapperFilter, but it doesn't seem to give me what I need, since its
getDocIdSet method returns an iterator which is the Scorer of the Query that
it wraps.

Anyway, I think the documentation of CWF should be fixed and made clearer.

Any thoughts?

Shai

Re: Question on CachingWrapperFilter

Posted by Michael McCandless <lu...@mikemccandless.com>.
I think, once we can efficiently apply cheap random-access docIDSets
the way deleted docs are applied (ie, distribute down to all
SegmentTermDocs) then it'd be useful for this filter manager to also
pre-fold deletes in, such that SegmentTermDocs would only have a
single random-access docIDSet to check.

Mike

On Wed, Jun 3, 2009 at 4:03 AM, Shai Erera<se...@gmail.com> wrote:
> Thanks Paul !
>
> I'll work such a utility (which takes a Filter and reads it into an
> OpenBitSet, SortedVIntList) and then post back in case you'll be interested
> in adopting it, and change CWF to use it, or something else.
>
> Shai
>
> On Tue, Jun 2, 2009 at 9:35 PM, Paul Elschot <pa...@xs4all.nl> wrote:
>>
>> On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
>> > Hi
>> >
>> > I read CWF today and initially I thought this is going to cache a Filter
>> > in-memory for me, so that I can more efficiently use it for subsequent
>> > searches. But I learned that all it does is cache the DocIdSet returned
>> > by
>> > the wrapped Filter.
>> >
>> > This is good in and on itself, but I wonder if we shouldn't go the extra
>> > mile and wrap stuff in memory for Filters which don't operate from
>> > memory.
>>
>> It was good until QueryWrapperFilter returned a Scorer instead of a disi
>> based on an (Open)BitSet.
>>
>> > For example - I have a Filter which reads information from a Payload as
>> > it's
>> > iterated on, so it doesn't keep anything in memory (it's per-user
>> > information, so I haven't decided yet if I can afford caching it
>> > in-memory
>> > and whether it will be beneficial). Caching that sort of Filter by CWF
>> > will
>> > obviously not improve anything.
>> >
>> > I'm not sure what to do here:
>> > 1. Just reflect that in the javadoc (it is very confusing saying "Wraps
>> > another filter's result and caches it", which is not true)
>> > 2. Introduce a class which takes a Filter and loads it into memory (I
>> > think
>> > I read an issue/discussion about this), to an OpenBitSet for example
>> > (but we
>> > need to know the number of results in advance, or grow the array as we
>> > go
>> > along).
>> > 3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility,
>> > and
>> > cache the Filters w/ the user as Key.
>>
>> For that, one could subclass CWF and override the docIdSetToCache method
>> to return an OpenBitSetDISI constructed from the given disi.
>>
>> > I will probably need to do the second part of (3) anyway, so I'm asking
>> > whether such a utility is useful to exist in Lucene, and perhaps there's
>> > already one (I thought I read somewhere about the ability to execute a
>> > Query
>> > and get back a Filter, or use the results as a Filter)?
>>
>> That is what QueryWrapperFilter does.
>>
>> > I looked at
>> > QueryWrapperFilter, but it doesn't seem to give me what I need, since
>> > its
>> > getDocIdSet method returns an iterator which is the Scorer of the Query
>> > that
>> > it wraps.
>>
>> The Scorer seems to be what you need, but there are cheaper disis, see
>> below.
>>
>> >
>> > Anyway, I think the documentation of CWF should be fixed and made
>> > clearer.
>> >
>> > Any thoughts?
>>
>> The basic problem is that disis from DocIdSets come in two variations:
>> expensive
>> ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or
>> on
>> a SortedVIntList.
>> One would normally want to cache a DocIdSet that provides a cheap disi.
>>
>> For the javadocs of the current CWF it could be sufficient to mention more
>> prominently that the default CWF caches the given DocIdSet, basically
>> assuming that it's disi is cheap.
>>
>> But it might be a good idea to change the default implementation to check
>> whether the given DocIdSet is an OpenBitSet, and use that to be cached in
>> that case, and otherwise provide an OpenBitSetDISI.
>>
>> Regards,
>> Paul Elschot
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Question on CachingWrapperFilter

Posted by Shai Erera <se...@gmail.com>.
Thanks Paul !

I'll work such a utility (which takes a Filter and reads it into an
OpenBitSet, SortedVIntList) and then post back in case you'll be interested
in adopting it, and change CWF to use it, or something else.

Shai

On Tue, Jun 2, 2009 at 9:35 PM, Paul Elschot <pa...@xs4all.nl> wrote:

> On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
> > Hi
> >
> > I read CWF today and initially I thought this is going to cache a Filter
> > in-memory for me, so that I can more efficiently use it for subsequent
> > searches. But I learned that all it does is cache the DocIdSet returned
> by
> > the wrapped Filter.
> >
> > This is good in and on itself, but I wonder if we shouldn't go the extra
> > mile and wrap stuff in memory for Filters which don't operate from
> memory.
>
>
> It was good until QueryWrapperFilter returned a Scorer instead of a disi
> based on an (Open)BitSet.
>
>
> > For example - I have a Filter which reads information from a Payload as
> it's
> > iterated on, so it doesn't keep anything in memory (it's per-user
> > information, so I haven't decided yet if I can afford caching it
> in-memory
> > and whether it will be beneficial). Caching that sort of Filter by CWF
> will
> > obviously not improve anything.
> >
> > I'm not sure what to do here:
> > 1. Just reflect that in the javadoc (it is very confusing saying "Wraps
> > another filter's result and caches it", which is not true)
> > 2. Introduce a class which takes a Filter and loads it into memory (I
> think
> > I read an issue/discussion about this), to an OpenBitSet for example (but
> we
> > need to know the number of results in advance, or grow the array as we go
> > along).
> > 3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility,
> and
> > cache the Filters w/ the user as Key.
>
>
> For that, one could subclass CWF and override the docIdSetToCache method
> to return an OpenBitSetDISI constructed from the given disi.
>
>
> > I will probably need to do the second part of (3) anyway, so I'm asking
> > whether such a utility is useful to exist in Lucene, and perhaps there's
> > already one (I thought I read somewhere about the ability to execute a
> Query
> > and get back a Filter, or use the results as a Filter)?
>
>
> That is what QueryWrapperFilter does.
>
>
> > I looked at
> > QueryWrapperFilter, but it doesn't seem to give me what I need, since its
> > getDocIdSet method returns an iterator which is the Scorer of the Query
> that
> > it wraps.
>
>
> The Scorer seems to be what you need, but there are cheaper disis, see
> below.
>
>
> >
> > Anyway, I think the documentation of CWF should be fixed and made
> clearer.
> >
> > Any thoughts?
>
>
> The basic problem is that disis from DocIdSets come in two variations:
> expensive
> ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or
> on
> a SortedVIntList.
> One would normally want to cache a DocIdSet that provides a cheap disi.
>
>
> For the javadocs of the current CWF it could be sufficient to mention more
> prominently that the default CWF caches the given DocIdSet, basically
> assuming that it's disi is cheap.
>
>
> But it might be a good idea to change the default implementation to check
> whether the given DocIdSet is an OpenBitSet, and use that to be cached in
> that case, and otherwise provide an OpenBitSetDISI.
>
>
> Regards,
> Paul Elschot
>
>
>

Re: Question on CachingWrapperFilter

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
> Hi
> 
> I read CWF today and initially I thought this is going to cache a Filter
> in-memory for me, so that I can more efficiently use it for subsequent
> searches. But I learned that all it does is cache the DocIdSet returned by
> the wrapped Filter.
> 
> This is good in and on itself, but I wonder if we shouldn't go the extra
> mile and wrap stuff in memory for Filters which don't operate from memory.

It was good until QueryWrapperFilter returned a Scorer instead of a disi
based on an (Open)BitSet.

> For example - I have a Filter which reads information from a Payload as it's
> iterated on, so it doesn't keep anything in memory (it's per-user
> information, so I haven't decided yet if I can afford caching it in-memory
> and whether it will be beneficial). Caching that sort of Filter by CWF will
> obviously not improve anything.
> 
> I'm not sure what to do here:
> 1. Just reflect that in the javadoc (it is very confusing saying "Wraps
> another filter's result and caches it", which is not true)
> 2. Introduce a class which takes a Filter and loads it into memory (I think
> I read an issue/discussion about this), to an OpenBitSet for example (but we
> need to know the number of results in advance, or grow the array as we go
> along).
> 3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility, and
> cache the Filters w/ the user as Key.

For that, one could subclass CWF and override the docIdSetToCache method
to return an OpenBitSetDISI constructed from the given disi.

> I will probably need to do the second part of (3) anyway, so I'm asking
> whether such a utility is useful to exist in Lucene, and perhaps there's
> already one (I thought I read somewhere about the ability to execute a Query
> and get back a Filter, or use the results as a Filter)?

That is what QueryWrapperFilter does.

> I looked at
> QueryWrapperFilter, but it doesn't seem to give me what I need, since its
> getDocIdSet method returns an iterator which is the Scorer of the Query that
> it wraps.

The Scorer seems to be what you need, but there are cheaper disis, see below.

> 
> Anyway, I think the documentation of CWF should be fixed and made clearer.
> 
> Any thoughts?

The basic problem is that disis from DocIdSets come in two variations: expensive
ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or on
a SortedVIntList.
One would normally want to cache a DocIdSet that provides a cheap disi.

For the javadocs of the current CWF it could be sufficient to mention more
prominently that the default CWF caches the given DocIdSet, basically
assuming that it's disi is cheap.

But it might be a good idea to change the default implementation to check
whether the given DocIdSet is an OpenBitSet, and use that to be cached in
that case, and otherwise provide an OpenBitSetDISI.

Regards,
Paul Elschot