You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2010/09/11 20:08:18 UTC

IndexReader Cache - a different angle

Hi

Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
many proposals to attack this problem, w/ no developed solution.

I'd like to explore a different, IMO much simpler, angle to attach this
problem. Instead of having Lucene manage the Cache itself, we let the
application manage it, however Lucene will provide the necessary hooks
in IndexReader to allow it. The hooks I have in mind are:

(1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
already exists.

(2) When reopen() is called, Lucene will take care to call a
Cache.load(IndexReader), so that the application can pull whatever
information
it needs from the passed-in IndexReader.

So to be more concrete on my proposal, I'd like to support caching in
the following way (and while I've spent some time thinking about it, I'm
sure there are great suggestions to improve it):

* Application provides a CacheFactory to IndexReader.open/reopen, which
exposes some very simple API, such as createCache, or
initCache(IndexReader) etc. Something which returns a Cache object,
which does not have very strict/concrete API.

* IndexReader, most probably at the SegmentReader level uses
CacheFactory to create a new Cache instance and calls its
load(IndexReader) method, so that the Cache would initialize itself.

* The application can use CacheFactory to obtain the Cache object per
IndexReader (for example, during Collector.setNextReader), or we can
have IndexReader offer a getCache() method.

* One of Cache API would be getCache(TYPE), where TYPE is a String or
Object, or an interface CacheType w/ no methods, just to be a marker
one, and the application is free to impl it however it wants. That's a
loose API, I know, but completely at the application hands, which makes
Lucene code simpler.

* We can introduce a TermsCache, TermEnumCache and TermVectorCache to
provide the user w/ IndexReader-similar API, only more efficient than
say TermDocs -- something w/ random access to the docs inside, perhaps
even an OpenBitSet. Lucene can take advantage of it if, say, we create a
CachingSegmentReader which makes use of the cache, and checks every time
termDocs() is called if the required Term is cached or not etc. I admit
I may be thinking too much ahead.

That's more or less what I've been thinking. I'm sure there are many
details to iron out, but I hope I've managed to pass the general
proposal through to you.

What I'm after first, is to allow applications deal w/ postings caching more

natively. For example, if you have a posting w/ payloads you'd like to
read into memory, or if you would like a term's TermDocs to be cached
(to be used as a Filter) etc. -- instead of writing something that can
work at the top IndexReader level, you'd be able to take advantage of
Lucene internals, i.e. refresh the Cache only for the new segments ...

I'm sure that after this will be in place, we can refactor FieldCache to
work w/ that API, perhaps as a Cache specific implementation. But I
leave that for later.

I'd appreciate your comments. Before I set to implement it, I'd like to
know if the idea has any chances of making it to a commit :).

Shai

Re: IndexReader Cache - a different angle

Posted by Shai Erera <se...@gmail.com>.

Actually, after writing the last email, I arrived at my office and was asked
a question about Filter working at the per-segment level. I completely
missed that, so indeed for the Filter approach, CachingWrapperFilter will do
the trick. Well ... it will do 'half' the trick - to warm it up, I'll need
to execute a search using that Filter, so that the new segments get cached
too, right? If I could warm just the Filters on the new segments, that'd be
best.

I'm getting more and more convinced that for 4.0, Codec seems to be the
right place for postings caching (term docs, positions, payloads ...), but
what about 3x? Perhaps this should be a 4.0 feature only ... if only 4.0
will be out sometime in the near future (like in 2010).

What I had in mind for 3x is that we can create a CachingSR which will use
the internal Cache for postings when they are requested. But perhaps we
should not work too hard to enable such a thing, especially if 4.0 has a
nice way of supporting it.

I'll try to work with what I have in 3x for now, and take a look at Codecs,
to perhaps prepare better for when it comes out. Thanks for your comments !

Shai

On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Having hooks to enable an app to manage its own "external, private
> stuff associated w/ each segment reader" would be useful and it's been
> asked for in the past.  However, since we've now opened up
> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
> already do this w/o core API changes?
>
> I know Earwin has built a whole system like this on top of Lucene --
> Earwin how did you do that...?  Did you make core changes to
> Lucene...?
>
> A custom Codec should be an excellent way to handle the specific use
> cache (caching certain postings) -- by doing it as a Codec, any time
> anything in Lucene needs to tap into that posting (query scorers,
> filters, merging, applying deletes, etc), it hits this cache.  You
> could model it like PulsingCodec, which wraps any other Codec but
> handles the low-freq ones itself.  If you do it externally how would
> core use of postings hit it?  (Or was that not the intention?)
>
> I don't understand the filter use-case... the CachingWrapperFilter
> already caches per-segment, so that reopen is efficient?  How would an
> external cache (built on these hooks) be different?
>
> For faster filters we have to apply them like we do deleted docs if
> the filter is "random access" (such as being cached), LUCENE-1536 --
> flex actually makes this relatively easy now, since the postings API
> no longer implicitly filters deleted docs (ie you provide your own
> skipDocs) -- but these hooks won't fix that right?
>
> Mike
>
> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
> <si...@googlemail.com> wrote:
> > Hey Shai,
> >
> > On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <se...@gmail.com> wrote:
> >> Hey Simon,
> >>
> >> You're right that the application can develop a Caching mechanism
> outside
> >> Lucene, and when reopen() is called, if it changed, iterate on the
> >> sub-readers and init the Cache w/ the new ones.
> >
> > Alright, then we are on the same track I guess!
> >
> >>
> >> However, by building something like that inside Lucene, the application
> will
> >> get more native support, and thus better performance, in some cases. For
> >> example, consider a field fileType with 10 possible values, and for the
> sake
> >> of simplicity, let's say that the index is divided evenly across them.
> Your
> >> users always add such a term constraint to the query (e.g. they want to
> get
> >> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but
> not
> >> others). You have basically two ways of supporting this:
> >> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
> >> relation -- cons is that this term / posting is read for every query.
> >
> > Oh I wasn't saying that a cache framework would be obsolet and
> > shouldn't be part of lucene. My intention was rather to generalize
> > this functionality so that we can make the API change more easily and
> > at the same time brining the infrastructure you are proposing in
> > place.
> >
> > Regarding you example above, filters are a very good example where
> > something like that could help to improve performance and we should
> > provide it with lucene core but I would again prefer the least
> > intrusive way to do so. If we can make that happen without adding any
> > cache agnostic API we should do it. We really should try to sketch out
> > a simple API with gives us access to the opened segReaders and see if
> > that would be sufficient for our usecases. Specialization will always
> > be possible but I doubt that it is needed.
> >>
> >> (2) Write a Filter which works at the top IR level, that is refreshed
> >> whenever the index is refreshed. This is better than (1), however has
> some
> >> disadvantages:
> >>
> >> (2.1) As Mike already proved (on some issue which I don't remember its
> >> subject/number at the moment), if we could get Filter down to the lower
> >> level components of Lucene's search, so e.g. it is used as the deleted
> docs
> >> OBS, we can get better performance w/ Filters.
> >>
> >> (2.2) The Filter is refreshed for the entire IR, and not just the
> changed
> >> segments. Reason is, outside Collector, you have no way of telling
> >> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
> >> Loading/refreshing the Filter may be expensive, and definitely won't
> perform
> >> well w/ NRT, where by definition you'd like to get small changes
> searchable
> >> very fast.
> >
> > No doubt you are right about the above. A
> > PerSegmentCachingFilterWrapper would be something we can easily do on
> > an application level basis with the infrastructure we are talking
> > about in place. While I don't exactly know how I feel that this
> > particular problem should rather be addressed internally and I'm not
> > sure if the high level Cache mechanism is the right way to do it but
> > this is just a gut feeling. But when I think about it twice it might
> > be way sufficient enough to do it....
> >>
> >> Therefore I think that if we could provide the necessary hooks in
> Lucene,
> >> let's call it a Cache plug-in for now, we can incrementally improve the
> >> search process. I don't want to go too far into the design of a generic
> >> plug-ins mechanism, but you're right (again :)) -- we could offer a
> >> reopen(PluginProvider) which is entirely not about Cache, and Cache
> would
> >> become one of the Plugins the PluginProvider provides. I just try to
> learn
> >> from past experience -- when the discussion is focused, there's a better
> >> chance of getting to a resolution. However if you think that in this
> case, a
> >> more generic API, as PluginProvider, would get us to a resolution
> faster, I
> >> don't mind spend some time to think about it. But for all practical
> >> purposes, we should IMO start w/ a Cache plug-in, that is called like
> that,
> >> and if it catches, generify it ...
> > I absolutely agree the API might be more generic but our current
> > use-case / PoC should be a caching. I don't like the name Plugin but
> > thats a personal thing since you are not pluggin anything is.
> > Something like SubreaderCallback or ReaderVisitor might be more
> > accurate but lets argue about the details later. Why not sketching
> > something out for the filter problem and follow on from there? The
> > more iteration the better and back to your question if that would be
> > something which could make it to be committable I would say if it
> > works stand alone / not to tightly coupled I would absolutely say yes.
> >>
> >> Unfortunately, I haven't had enough experience w/ Codecs yet (still on
> 3x)
> >> so I can't comment on how feasible that solution is. I'll take your word
> for
> >> it that it's doable :). But this doesn't give us a 3x solution ... the
> >> Caching framework on trunk can be developed w/ Codecs.
> >
> > I guess nobody really has except of mike and maybe one or two others
> > but what I have done so far regarding codecs I would say that is the
> > place to solve this particular problem. Maybe even lower than that on
> > a Directory level. Anyhow, lets focus on application level caches for
> > now. We are not aiming to provide a whole full fledged Cache API but
> > the infrastructure to make it easier to build those on a app basis
> > which would be a valuable improvement. We should also look at Solr's
> > cache implementations and how they could benefit from this efforts
> > since Solr uses app-level caching we can learn from API design wise.
> >
> > simon
> >>
> >> Shai
> >>
> >> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
> >> <si...@googlemail.com> wrote:
> >>>
> >>> Hi Shai,
> >>>
> >>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
> >>> > Hi
> >>> >
> >>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
> >>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have
> been
> >>> > many proposals to attack this problem, w/ no developed solution.
> >>>
> >>> I didn't go through those issues so forgive me if something I bring up
> >>> has already been discussed.
> >>> I have a couple of question about your proposal - please find them
> >>> inline...
> >>>
> >>> >
> >>> > I'd like to explore a different, IMO much simpler, angle to attach
> this
> >>> > problem. Instead of having Lucene manage the Cache itself, we let the
> >>> > application manage it, however Lucene will provide the necessary
> hooks
> >>> > in IndexReader to allow it. The hooks I have in mind are:
> >>> >
> >>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions
> etc.
> >>> > --
> >>> > already exists.
> >>> >
> >>> > (2) When reopen() is called, Lucene will take care to call a
> >>> > Cache.load(IndexReader), so that the application can pull whatever
> >>> > information
> >>> > it needs from the passed-in IndexReader.
> >>> Would that do anything else than passing the new reader (if reopened)
> >>> to the caches load method? I wonder if this is more than
> >>> If(newReader != oldReader)
> >>>  Cache.load(newReader)
> >>>
> >>> If so something like that should be done on a segment reader anyway,
> >>> right? From my perspective this isn't more than a callback or visitor
> >>> that should walk though the subreaders and called for each reopened
> >>> sub-reader. A cache-warming visitor / callback would then be trivial
> >>> and the API would be more general.
> >>>
> >>>
> >>> > So to be more concrete on my proposal, I'd like to support caching in
> >>> > the following way (and while I've spent some time thinking about it,
> I'm
> >>> > sure there are great suggestions to improve it):
> >>> >
> >>> > * Application provides a CacheFactory to IndexReader.open/reopen,
> which
> >>> > exposes some very simple API, such as createCache, or
> >>> > initCache(IndexReader) etc. Something which returns a Cache object,
> >>> > which does not have very strict/concrete API.
> >>>
> >>> My first question would be why the reader should know about Cache if
> >>> there is no strict / concrete API?
> >>> I can follow you with the CacheFactory to create cache objects but why
> >>> would the reader have to know / "receive" this object? Maybe this is
> >>> answered further down the path but I don't see the reason why the
> >>> notion of a "cache" must exist within open/reopen or if that could be
> >>> implemented in a more general "cache" - agnostic way.
> >>> >
> >>> > * IndexReader, most probably at the SegmentReader level uses
> >>> > CacheFactory to create a new Cache instance and calls its
> >>> > load(IndexReader) method, so that the Cache would initialize itself.
> >>> That is what I was thinking above - yet is that more than a callback
> >>> for each reopened or opened segment reader?
> >>>
> >>> >
> >>> > * The application can use CacheFactory to obtain the Cache object per
> >>> > IndexReader (for example, during Collector.setNextReader), or we can
> >>> > have IndexReader offer a getCache() method.
> >>> :)  until here the cache is only used by the application itself not by
> >>> any Lucene API, right? I can think of many application specific data
> >>> that could be useful to be associated with an IR beyond the cacheing
> >>> use case - again this could be a more general API solving that
> >>> problem.
> >>> >
> >>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
> >>> > Object, or an interface CacheType w/ no methods, just to be a marker
> >>> > one, and the application is free to impl it however it wants. That's
> a
> >>> > loose API, I know, but completely at the application hands, which
> makes
> >>> > Lucene code simpler.
> >>> I like the idea together with the metadata associating functionality
> >>> from above something like public T IndexReader#get(Type<T> type).
> >>> Hmm that looks quiet similar to Attributes, does it?! :) However this
> >>> could be done in many ways but again "cache" - agnositc
> >>> >
> >>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
> >>> > provide the user w/ IndexReader-similar API, only more efficient than
> >>> > say TermDocs -- something w/ random access to the docs inside,
> perhaps
> >>> > even an OpenBitSet. Lucene can take advantage of it if, say, we
> create a
> >>> > CachingSegmentReader which makes use of the cache, and checks every
> time
> >>> > termDocs() is called if the required Term is cached or not etc. I
> admit
> >>> > I may be thinking too much ahead.
> >>> I see what you are trying to do here. I also see how this could be
> >>> useful but I guess coming up with a stable APi which serves lots of
> >>> applications would be quiet hard. A CachingSegmentReader could be a
> >>> very simple decorator which would not require to touch the IR
> >>> interface. Something like that could be part of lucene but I'm not
> >>> sure if necessarily part of lucene core.
> >>>
> >>> > That's more or less what I've been thinking. I'm sure there are many
> >>> > details to iron out, but I hope I've managed to pass the general
> >>> > proposal through to you.
> >>>
> >>> Absolutely, this is how it works isn't it!
> >>>
> >>> >
> >>> > What I'm after first, is to allow applications deal w/ postings
> caching
> >>> > more
> >>> > natively. For example, if you have a posting w/ payloads you'd like
> to
> >>> > read into memory, or if you would like a term's TermDocs to be cached
> >>> > (to be used as a Filter) etc. -- instead of writing something that
> can
> >>> > work at the top IndexReader level, you'd be able to take advantage of
> >>> > Lucene internals, i.e. refresh the Cache only for the new segments
> ...
> >>>
> >>> I wonder if a custom codec would be the right place to implement
> >>> caching / mem resident structures for Postings with payloads etc. You
> >>> could do that on a higher level too but codec seems to be the way to
> >>> go here, right?
> >>> To utilize per segment capabilities a callback for (re)opened segment
> >>> readers would be sufficient or do I miss something?
> >>>
> >>> simon
> >>> >
> >>> > I'm sure that after this will be in place, we can refactor FieldCache
> to
> >>> > work w/ that API, perhaps as a Cache specific implementation. But I
> >>> > leave that for later.
> >>> >
> >>> > I'd appreciate your comments. Before I set to implement it, I'd like
> to
> >>> > know if the idea has any chances of making it to a commit :).
> >>> >
> >>> > Shai
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: IndexReader Cache - a different angle

Posted by Lance Norskog <go...@gmail.com>.

Could there be another implementation of sorting? With very large
indexes, and small total result spaces, it would makes sense to
maintain a partial list of sorted ids per field. Every search that
finds new ids, adds them to the master list. There can even have a
cache eviction policy.

Lance

On Mon, Sep 13, 2010 at 8:01 AM, Danil ŢORIN <to...@gmail.com> wrote:
> And it would be nice to have hooks in lucene and avoid managing refs
> to indexReader on reopen() and close() by myself.
>
> Oh...and to complicate things, my index it's near-realtime using
> IndexWriter.getReader(), so it's not just IndexReader we need to
> change, but also IndexWriter should provide a reader that has proper
> FieldCache implementation.
>
> And I'm a bit uncomfortable to dig that deep :)
>
> On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <to...@gmail.com> wrote:
>> I'd second that....
>>
>> In my usecase we need to search, sometimes with sort, on pretty big index...
>>
>> So in worst case scenario we get OOM while loading FieldCache as it
>> tries do create an huge array.
>> You can increase -Xmx, go to bigger host, but in the end there WILL be
>> an index big enough to crash you.
>>
>> My idea would be to use something like EhCache with few elements in
>> memory and overflow to disk, so that if there are few unique terms, it
>> would be almost as fast as an array.
>> Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
>> EhCache on disk (yes it would be a huge performance hit) but I won't
>> get OOMs and the results STILL will be sorted.
>>
>> Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
>> FieldCache.DEFAULT
>>
>> And yes, I'm on 3.x...
>>
>>
>> On Mon, Sep 13, 2010 at 16:05, Tim Smith <ts...@attivio.com> wrote:
>>>  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
>>> proposing pretty much what seems to be discussed here
>>>
>>>
>>>  -- Tim
>>>
>>> On 09/12/10 10:18, Simon Willnauer wrote:
>>>>
>>>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
>>>> <lu...@mikemccandless.com>  wrote:
>>>>>
>>>>> Having hooks to enable an app to manage its own "external, private
>>>>> stuff associated w/ each segment reader" would be useful and it's been
>>>>> asked for in the past.  However, since we've now opened up
>>>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>>>>> already do this w/o core API changes?
>>>>
>>>> The visitor approach would simply be a little more than syntactic
>>>> sugar where only new SubReader instances are passed to the callback.
>>>> You can do the same with the already existing API like
>>>> gatherSubReaders or getSequentialSubReaders. Every API I was talking
>>>> about would just be simplification anyway and would be possible to
>>>> build without changing the core.
>>>>>
>>>>> I know Earwin has built a whole system like this on top of Lucene --
>>>>> Earwin how did you do that...?  Did you make core changes to
>>>>> Lucene...?
>>>>>
>>>>> A custom Codec should be an excellent way to handle the specific use
>>>>> cache (caching certain postings) -- by doing it as a Codec, any time
>>>>> anything in Lucene needs to tap into that posting (query scorers,
>>>>> filters, merging, applying deletes, etc), it hits this cache.  You
>>>>> could model it like PulsingCodec, which wraps any other Codec but
>>>>> handles the low-freq ones itself.  If you do it externally how would
>>>>> core use of postings hit it?  (Or was that not the intention?)
>>>>>
>>>>> I don't understand the filter use-case... the CachingWrapperFilter
>>>>> already caches per-segment, so that reopen is efficient?  How would an
>>>>> external cache (built on these hooks) be different?
>>>>
>>>> Man you are right - never mind :)
>>>>
>>>> simon
>>>>>
>>>>> For faster filters we have to apply them like we do deleted docs if
>>>>> the filter is "random access" (such as being cached), LUCENE-1536 --
>>>>> flex actually makes this relatively easy now, since the postings API
>>>>> no longer implicitly filters deleted docs (ie you provide your own
>>>>> skipDocs) -- but these hooks won't fix that right?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>>>>> <si...@googlemail.com>  wrote:
>>>>>>
>>>>>> Hey Shai,
>>>>>>
>>>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<se...@gmail.com>  wrote:
>>>>>>>
>>>>>>> Hey Simon,
>>>>>>>
>>>>>>> You're right that the application can develop a Caching mechanism
>>>>>>> outside
>>>>>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>>>>>> sub-readers and init the Cache w/ the new ones.
>>>>>>
>>>>>> Alright, then we are on the same track I guess!
>>>>>>
>>>>>>> However, by building something like that inside Lucene, the application
>>>>>>> will
>>>>>>> get more native support, and thus better performance, in some cases.
>>>>>>> For
>>>>>>> example, consider a field fileType with 10 possible values, and for the
>>>>>>> sake
>>>>>>> of simplicity, let's say that the index is divided evenly across them.
>>>>>>> Your
>>>>>>> users always add such a term constraint to the query (e.g. they want to
>>>>>>> get
>>>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both,
>>>>>>> but not
>>>>>>> others). You have basically two ways of supporting this:
>>>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>>>>>> relation -- cons is that this term / posting is read for every query.
>>>>>>
>>>>>> Oh I wasn't saying that a cache framework would be obsolet and
>>>>>> shouldn't be part of lucene. My intention was rather to generalize
>>>>>> this functionality so that we can make the API change more easily and
>>>>>> at the same time brining the infrastructure you are proposing in
>>>>>> place.
>>>>>>
>>>>>> Regarding you example above, filters are a very good example where
>>>>>> something like that could help to improve performance and we should
>>>>>> provide it with lucene core but I would again prefer the least
>>>>>> intrusive way to do so. If we can make that happen without adding any
>>>>>> cache agnostic API we should do it. We really should try to sketch out
>>>>>> a simple API with gives us access to the opened segReaders and see if
>>>>>> that would be sufficient for our usecases. Specialization will always
>>>>>> be possible but I doubt that it is needed.
>>>>>>>
>>>>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>>>>> whenever the index is refreshed. This is better than (1), however has
>>>>>>> some
>>>>>>> disadvantages:
>>>>>>>
>>>>>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>>>>>> subject/number at the moment), if we could get Filter down to the lower
>>>>>>> level components of Lucene's search, so e.g. it is used as the deleted
>>>>>>> docs
>>>>>>> OBS, we can get better performance w/ Filters.
>>>>>>>
>>>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the
>>>>>>> changed
>>>>>>> segments. Reason is, outside Collector, you have no way of telling
>>>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>>>>>> Loading/refreshing the Filter may be expensive, and definitely won't
>>>>>>> perform
>>>>>>> well w/ NRT, where by definition you'd like to get small changes
>>>>>>> searchable
>>>>>>> very fast.
>>>>>>
>>>>>> No doubt you are right about the above. A
>>>>>> PerSegmentCachingFilterWrapper would be something we can easily do on
>>>>>> an application level basis with the infrastructure we are talking
>>>>>> about in place. While I don't exactly know how I feel that this
>>>>>> particular problem should rather be addressed internally and I'm not
>>>>>> sure if the high level Cache mechanism is the right way to do it but
>>>>>> this is just a gut feeling. But when I think about it twice it might
>>>>>> be way sufficient enough to do it....
>>>>>>>
>>>>>>> Therefore I think that if we could provide the necessary hooks in
>>>>>>> Lucene,
>>>>>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>>>>>> search process. I don't want to go too far into the design of a generic
>>>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache
>>>>>>> would
>>>>>>> become one of the Plugins the PluginProvider provides. I just try to
>>>>>>> learn
>>>>>>> from past experience -- when the discussion is focused, there's a
>>>>>>> better
>>>>>>> chance of getting to a resolution. However if you think that in this
>>>>>>> case, a
>>>>>>> more generic API, as PluginProvider, would get us to a resolution
>>>>>>> faster, I
>>>>>>> don't mind spend some time to think about it. But for all practical
>>>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like
>>>>>>> that,
>>>>>>> and if it catches, generify it ...
>>>>>>
>>>>>> I absolutely agree the API might be more generic but our current
>>>>>> use-case / PoC should be a caching. I don't like the name Plugin but
>>>>>> thats a personal thing since you are not pluggin anything is.
>>>>>> Something like SubreaderCallback or ReaderVisitor might be more
>>>>>> accurate but lets argue about the details later. Why not sketching
>>>>>> something out for the filter problem and follow on from there? The
>>>>>> more iteration the better and back to your question if that would be
>>>>>> something which could make it to be committable I would say if it
>>>>>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>>>>>
>>>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on
>>>>>>> 3x)
>>>>>>> so I can't comment on how feasible that solution is. I'll take your
>>>>>>> word for
>>>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>>>>>> Caching framework on trunk can be developed w/ Codecs.
>>>>>>
>>>>>> I guess nobody really has except of mike and maybe one or two others
>>>>>> but what I have done so far regarding codecs I would say that is the
>>>>>> place to solve this particular problem. Maybe even lower than that on
>>>>>> a Directory level. Anyhow, lets focus on application level caches for
>>>>>> now. We are not aiming to provide a whole full fledged Cache API but
>>>>>> the infrastructure to make it easier to build those on a app basis
>>>>>> which would be a valuable improvement. We should also look at Solr's
>>>>>> cache implementations and how they could benefit from this efforts
>>>>>> since Solr uses app-level caching we can learn from API design wise.
>>>>>>
>>>>>> simon
>>>>>>>
>>>>>>> Shai
>>>>>>>
>>>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>>>>> <si...@googlemail.com>  wrote:
>>>>>>>>
>>>>>>>> Hi Shai,
>>>>>>>>
>>>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<se...@gmail.com>  wrote:
>>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have
>>>>>>>>> been
>>>>>>>>> many proposals to attack this problem, w/ no developed solution.
>>>>>>>>
>>>>>>>> I didn't go through those issues so forgive me if something I bring up
>>>>>>>> has already been discussed.
>>>>>>>> I have a couple of question about your proposal - please find them
>>>>>>>> inline...
>>>>>>>>
>>>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach
>>>>>>>>> this
>>>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the
>>>>>>>>> application manage it, however Lucene will provide the necessary
>>>>>>>>> hooks
>>>>>>>>> in IndexReader to allow it. The hooks I have in mind are:
>>>>>>>>>
>>>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions
>>>>>>>>> etc.
>>>>>>>>> --
>>>>>>>>> already exists.
>>>>>>>>>
>>>>>>>>> (2) When reopen() is called, Lucene will take care to call a
>>>>>>>>> Cache.load(IndexReader), so that the application can pull whatever
>>>>>>>>> information
>>>>>>>>> it needs from the passed-in IndexReader.
>>>>>>>>
>>>>>>>> Would that do anything else than passing the new reader (if reopened)
>>>>>>>> to the caches load method? I wonder if this is more than
>>>>>>>> If(newReader != oldReader)
>>>>>>>>  Cache.load(newReader)
>>>>>>>>
>>>>>>>> If so something like that should be done on a segment reader anyway,
>>>>>>>> right? From my perspective this isn't more than a callback or visitor
>>>>>>>> that should walk though the subreaders and called for each reopened
>>>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>>>>>> and the API would be more general.
>>>>>>>>
>>>>>>>>
>>>>>>>>> So to be more concrete on my proposal, I'd like to support caching in
>>>>>>>>> the following way (and while I've spent some time thinking about it,
>>>>>>>>> I'm
>>>>>>>>> sure there are great suggestions to improve it):
>>>>>>>>>
>>>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen,
>>>>>>>>> which
>>>>>>>>> exposes some very simple API, such as createCache, or
>>>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object,
>>>>>>>>> which does not have very strict/concrete API.
>>>>>>>>
>>>>>>>> My first question would be why the reader should know about Cache if
>>>>>>>> there is no strict / concrete API?
>>>>>>>> I can follow you with the CacheFactory to create cache objects but why
>>>>>>>> would the reader have to know / "receive" this object? Maybe this is
>>>>>>>> answered further down the path but I don't see the reason why the
>>>>>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>>>>>> implemented in a more general "cache" - agnostic way.
>>>>>>>>>
>>>>>>>>> * IndexReader, most probably at the SegmentReader level uses
>>>>>>>>> CacheFactory to create a new Cache instance and calls its
>>>>>>>>> load(IndexReader) method, so that the Cache would initialize itself.
>>>>>>>>
>>>>>>>> That is what I was thinking above - yet is that more than a callback
>>>>>>>> for each reopened or opened segment reader?
>>>>>>>>
>>>>>>>>> * The application can use CacheFactory to obtain the Cache object per
>>>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can
>>>>>>>>> have IndexReader offer a getCache() method.
>>>>>>>>
>>>>>>>> :)  until here the cache is only used by the application itself not by
>>>>>>>> any Lucene API, right? I can think of many application specific data
>>>>>>>> that could be useful to be associated with an IR beyond the cacheing
>>>>>>>> use case - again this could be a more general API solving that
>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker
>>>>>>>>> one, and the application is free to impl it however it wants. That's
>>>>>>>>> a
>>>>>>>>> loose API, I know, but completely at the application hands, which
>>>>>>>>> makes
>>>>>>>>> Lucene code simpler.
>>>>>>>>
>>>>>>>> I like the idea together with the metadata associating functionality
>>>>>>>> from above something like public T IndexReader#get(Type<T>  type).
>>>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>>>>>> could be done in many ways but again "cache" - agnositc
>>>>>>>>>
>>>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than
>>>>>>>>> say TermDocs -- something w/ random access to the docs inside,
>>>>>>>>> perhaps
>>>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we
>>>>>>>>> create a
>>>>>>>>> CachingSegmentReader which makes use of the cache, and checks every
>>>>>>>>> time
>>>>>>>>> termDocs() is called if the required Term is cached or not etc. I
>>>>>>>>> admit
>>>>>>>>> I may be thinking too much ahead.
>>>>>>>>
>>>>>>>> I see what you are trying to do here. I also see how this could be
>>>>>>>> useful but I guess coming up with a stable APi which serves lots of
>>>>>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>>>>>> very simple decorator which would not require to touch the IR
>>>>>>>> interface. Something like that could be part of lucene but I'm not
>>>>>>>> sure if necessarily part of lucene core.
>>>>>>>>
>>>>>>>>> That's more or less what I've been thinking. I'm sure there are many
>>>>>>>>> details to iron out, but I hope I've managed to pass the general
>>>>>>>>> proposal through to you.
>>>>>>>>
>>>>>>>> Absolutely, this is how it works isn't it!
>>>>>>>>
>>>>>>>>> What I'm after first, is to allow applications deal w/ postings
>>>>>>>>> caching
>>>>>>>>> more
>>>>>>>>> natively. For example, if you have a posting w/ payloads you'd like
>>>>>>>>> to
>>>>>>>>> read into memory, or if you would like a term's TermDocs to be cached
>>>>>>>>> (to be used as a Filter) etc. -- instead of writing something that
>>>>>>>>> can
>>>>>>>>> work at the top IndexReader level, you'd be able to take advantage of
>>>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments
>>>>>>>>> ...
>>>>>>>>
>>>>>>>> I wonder if a custom codec would be the right place to implement
>>>>>>>> caching / mem resident structures for Postings with payloads etc. You
>>>>>>>> could do that on a higher level too but codec seems to be the way to
>>>>>>>> go here, right?
>>>>>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>>>>>> readers would be sufficient or do I miss something?
>>>>>>>>
>>>>>>>> simon
>>>>>>>>>
>>>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache
>>>>>>>>> to
>>>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I
>>>>>>>>> leave that for later.
>>>>>>>>>
>>>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like
>>>>>>>>> to
>>>>>>>>> know if the idea has any chances of making it to a commit :).
>>>>>>>>>
>>>>>>>>> Shai
>>>>>>>>>
>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Danil ŢORIN <to...@gmail.com>.

And it would be nice to have hooks in lucene and avoid managing refs
to indexReader on reopen() and close() by myself.

Oh...and to complicate things, my index it's near-realtime using
IndexWriter.getReader(), so it's not just IndexReader we need to
change, but also IndexWriter should provide a reader that has proper
FieldCache implementation.

And I'm a bit uncomfortable to dig that deep :)

On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <to...@gmail.com> wrote:
> I'd second that....
>
> In my usecase we need to search, sometimes with sort, on pretty big index...
>
> So in worst case scenario we get OOM while loading FieldCache as it
> tries do create an huge array.
> You can increase -Xmx, go to bigger host, but in the end there WILL be
> an index big enough to crash you.
>
> My idea would be to use something like EhCache with few elements in
> memory and overflow to disk, so that if there are few unique terms, it
> would be almost as fast as an array.
> Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
> EhCache on disk (yes it would be a huge performance hit) but I won't
> get OOMs and the results STILL will be sorted.
>
> Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
> FieldCache.DEFAULT
>
> And yes, I'm on 3.x...
>
>
> On Mon, Sep 13, 2010 at 16:05, Tim Smith <ts...@attivio.com> wrote:
>>  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
>> proposing pretty much what seems to be discussed here
>>
>>
>>  -- Tim
>>
>> On 09/12/10 10:18, Simon Willnauer wrote:
>>>
>>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
>>> <lu...@mikemccandless.com>  wrote:
>>>>
>>>> Having hooks to enable an app to manage its own "external, private
>>>> stuff associated w/ each segment reader" would be useful and it's been
>>>> asked for in the past.  However, since we've now opened up
>>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>>>> already do this w/o core API changes?
>>>
>>> The visitor approach would simply be a little more than syntactic
>>> sugar where only new SubReader instances are passed to the callback.
>>> You can do the same with the already existing API like
>>> gatherSubReaders or getSequentialSubReaders. Every API I was talking
>>> about would just be simplification anyway and would be possible to
>>> build without changing the core.
>>>>
>>>> I know Earwin has built a whole system like this on top of Lucene --
>>>> Earwin how did you do that...?  Did you make core changes to
>>>> Lucene...?
>>>>
>>>> A custom Codec should be an excellent way to handle the specific use
>>>> cache (caching certain postings) -- by doing it as a Codec, any time
>>>> anything in Lucene needs to tap into that posting (query scorers,
>>>> filters, merging, applying deletes, etc), it hits this cache.  You
>>>> could model it like PulsingCodec, which wraps any other Codec but
>>>> handles the low-freq ones itself.  If you do it externally how would
>>>> core use of postings hit it?  (Or was that not the intention?)
>>>>
>>>> I don't understand the filter use-case... the CachingWrapperFilter
>>>> already caches per-segment, so that reopen is efficient?  How would an
>>>> external cache (built on these hooks) be different?
>>>
>>> Man you are right - never mind :)
>>>
>>> simon
>>>>
>>>> For faster filters we have to apply them like we do deleted docs if
>>>> the filter is "random access" (such as being cached), LUCENE-1536 --
>>>> flex actually makes this relatively easy now, since the postings API
>>>> no longer implicitly filters deleted docs (ie you provide your own
>>>> skipDocs) -- but these hooks won't fix that right?
>>>>
>>>> Mike
>>>>
>>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>>>> <si...@googlemail.com>  wrote:
>>>>>
>>>>> Hey Shai,
>>>>>
>>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<se...@gmail.com>  wrote:
>>>>>>
>>>>>> Hey Simon,
>>>>>>
>>>>>> You're right that the application can develop a Caching mechanism
>>>>>> outside
>>>>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>>>>> sub-readers and init the Cache w/ the new ones.
>>>>>
>>>>> Alright, then we are on the same track I guess!
>>>>>
>>>>>> However, by building something like that inside Lucene, the application
>>>>>> will
>>>>>> get more native support, and thus better performance, in some cases.
>>>>>> For
>>>>>> example, consider a field fileType with 10 possible values, and for the
>>>>>> sake
>>>>>> of simplicity, let's say that the index is divided evenly across them.
>>>>>> Your
>>>>>> users always add such a term constraint to the query (e.g. they want to
>>>>>> get
>>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both,
>>>>>> but not
>>>>>> others). You have basically two ways of supporting this:
>>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>>>>> relation -- cons is that this term / posting is read for every query.
>>>>>
>>>>> Oh I wasn't saying that a cache framework would be obsolet and
>>>>> shouldn't be part of lucene. My intention was rather to generalize
>>>>> this functionality so that we can make the API change more easily and
>>>>> at the same time brining the infrastructure you are proposing in
>>>>> place.
>>>>>
>>>>> Regarding you example above, filters are a very good example where
>>>>> something like that could help to improve performance and we should
>>>>> provide it with lucene core but I would again prefer the least
>>>>> intrusive way to do so. If we can make that happen without adding any
>>>>> cache agnostic API we should do it. We really should try to sketch out
>>>>> a simple API with gives us access to the opened segReaders and see if
>>>>> that would be sufficient for our usecases. Specialization will always
>>>>> be possible but I doubt that it is needed.
>>>>>>
>>>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>>>> whenever the index is refreshed. This is better than (1), however has
>>>>>> some
>>>>>> disadvantages:
>>>>>>
>>>>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>>>>> subject/number at the moment), if we could get Filter down to the lower
>>>>>> level components of Lucene's search, so e.g. it is used as the deleted
>>>>>> docs
>>>>>> OBS, we can get better performance w/ Filters.
>>>>>>
>>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the
>>>>>> changed
>>>>>> segments. Reason is, outside Collector, you have no way of telling
>>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>>>>> Loading/refreshing the Filter may be expensive, and definitely won't
>>>>>> perform
>>>>>> well w/ NRT, where by definition you'd like to get small changes
>>>>>> searchable
>>>>>> very fast.
>>>>>
>>>>> No doubt you are right about the above. A
>>>>> PerSegmentCachingFilterWrapper would be something we can easily do on
>>>>> an application level basis with the infrastructure we are talking
>>>>> about in place. While I don't exactly know how I feel that this
>>>>> particular problem should rather be addressed internally and I'm not
>>>>> sure if the high level Cache mechanism is the right way to do it but
>>>>> this is just a gut feeling. But when I think about it twice it might
>>>>> be way sufficient enough to do it....
>>>>>>
>>>>>> Therefore I think that if we could provide the necessary hooks in
>>>>>> Lucene,
>>>>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>>>>> search process. I don't want to go too far into the design of a generic
>>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache
>>>>>> would
>>>>>> become one of the Plugins the PluginProvider provides. I just try to
>>>>>> learn
>>>>>> from past experience -- when the discussion is focused, there's a
>>>>>> better
>>>>>> chance of getting to a resolution. However if you think that in this
>>>>>> case, a
>>>>>> more generic API, as PluginProvider, would get us to a resolution
>>>>>> faster, I
>>>>>> don't mind spend some time to think about it. But for all practical
>>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like
>>>>>> that,
>>>>>> and if it catches, generify it ...
>>>>>
>>>>> I absolutely agree the API might be more generic but our current
>>>>> use-case / PoC should be a caching. I don't like the name Plugin but
>>>>> thats a personal thing since you are not pluggin anything is.
>>>>> Something like SubreaderCallback or ReaderVisitor might be more
>>>>> accurate but lets argue about the details later. Why not sketching
>>>>> something out for the filter problem and follow on from there? The
>>>>> more iteration the better and back to your question if that would be
>>>>> something which could make it to be committable I would say if it
>>>>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>>>>
>>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on
>>>>>> 3x)
>>>>>> so I can't comment on how feasible that solution is. I'll take your
>>>>>> word for
>>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>>>>> Caching framework on trunk can be developed w/ Codecs.
>>>>>
>>>>> I guess nobody really has except of mike and maybe one or two others
>>>>> but what I have done so far regarding codecs I would say that is the
>>>>> place to solve this particular problem. Maybe even lower than that on
>>>>> a Directory level. Anyhow, lets focus on application level caches for
>>>>> now. We are not aiming to provide a whole full fledged Cache API but
>>>>> the infrastructure to make it easier to build those on a app basis
>>>>> which would be a valuable improvement. We should also look at Solr's
>>>>> cache implementations and how they could benefit from this efforts
>>>>> since Solr uses app-level caching we can learn from API design wise.
>>>>>
>>>>> simon
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>>>> <si...@googlemail.com>  wrote:
>>>>>>>
>>>>>>> Hi Shai,
>>>>>>>
>>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<se...@gmail.com>  wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have
>>>>>>>> been
>>>>>>>> many proposals to attack this problem, w/ no developed solution.
>>>>>>>
>>>>>>> I didn't go through those issues so forgive me if something I bring up
>>>>>>> has already been discussed.
>>>>>>> I have a couple of question about your proposal - please find them
>>>>>>> inline...
>>>>>>>
>>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach
>>>>>>>> this
>>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the
>>>>>>>> application manage it, however Lucene will provide the necessary
>>>>>>>> hooks
>>>>>>>> in IndexReader to allow it. The hooks I have in mind are:
>>>>>>>>
>>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions
>>>>>>>> etc.
>>>>>>>> --
>>>>>>>> already exists.
>>>>>>>>
>>>>>>>> (2) When reopen() is called, Lucene will take care to call a
>>>>>>>> Cache.load(IndexReader), so that the application can pull whatever
>>>>>>>> information
>>>>>>>> it needs from the passed-in IndexReader.
>>>>>>>
>>>>>>> Would that do anything else than passing the new reader (if reopened)
>>>>>>> to the caches load method? I wonder if this is more than
>>>>>>> If(newReader != oldReader)
>>>>>>>  Cache.load(newReader)
>>>>>>>
>>>>>>> If so something like that should be done on a segment reader anyway,
>>>>>>> right? From my perspective this isn't more than a callback or visitor
>>>>>>> that should walk though the subreaders and called for each reopened
>>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>>>>> and the API would be more general.
>>>>>>>
>>>>>>>
>>>>>>>> So to be more concrete on my proposal, I'd like to support caching in
>>>>>>>> the following way (and while I've spent some time thinking about it,
>>>>>>>> I'm
>>>>>>>> sure there are great suggestions to improve it):
>>>>>>>>
>>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen,
>>>>>>>> which
>>>>>>>> exposes some very simple API, such as createCache, or
>>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object,
>>>>>>>> which does not have very strict/concrete API.
>>>>>>>
>>>>>>> My first question would be why the reader should know about Cache if
>>>>>>> there is no strict / concrete API?
>>>>>>> I can follow you with the CacheFactory to create cache objects but why
>>>>>>> would the reader have to know / "receive" this object? Maybe this is
>>>>>>> answered further down the path but I don't see the reason why the
>>>>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>>>>> implemented in a more general "cache" - agnostic way.
>>>>>>>>
>>>>>>>> * IndexReader, most probably at the SegmentReader level uses
>>>>>>>> CacheFactory to create a new Cache instance and calls its
>>>>>>>> load(IndexReader) method, so that the Cache would initialize itself.
>>>>>>>
>>>>>>> That is what I was thinking above - yet is that more than a callback
>>>>>>> for each reopened or opened segment reader?
>>>>>>>
>>>>>>>> * The application can use CacheFactory to obtain the Cache object per
>>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can
>>>>>>>> have IndexReader offer a getCache() method.
>>>>>>>
>>>>>>> :)  until here the cache is only used by the application itself not by
>>>>>>> any Lucene API, right? I can think of many application specific data
>>>>>>> that could be useful to be associated with an IR beyond the cacheing
>>>>>>> use case - again this could be a more general API solving that
>>>>>>> problem.
>>>>>>>>
>>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker
>>>>>>>> one, and the application is free to impl it however it wants. That's
>>>>>>>> a
>>>>>>>> loose API, I know, but completely at the application hands, which
>>>>>>>> makes
>>>>>>>> Lucene code simpler.
>>>>>>>
>>>>>>> I like the idea together with the metadata associating functionality
>>>>>>> from above something like public T IndexReader#get(Type<T>  type).
>>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>>>>> could be done in many ways but again "cache" - agnositc
>>>>>>>>
>>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than
>>>>>>>> say TermDocs -- something w/ random access to the docs inside,
>>>>>>>> perhaps
>>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we
>>>>>>>> create a
>>>>>>>> CachingSegmentReader which makes use of the cache, and checks every
>>>>>>>> time
>>>>>>>> termDocs() is called if the required Term is cached or not etc. I
>>>>>>>> admit
>>>>>>>> I may be thinking too much ahead.
>>>>>>>
>>>>>>> I see what you are trying to do here. I also see how this could be
>>>>>>> useful but I guess coming up with a stable APi which serves lots of
>>>>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>>>>> very simple decorator which would not require to touch the IR
>>>>>>> interface. Something like that could be part of lucene but I'm not
>>>>>>> sure if necessarily part of lucene core.
>>>>>>>
>>>>>>>> That's more or less what I've been thinking. I'm sure there are many
>>>>>>>> details to iron out, but I hope I've managed to pass the general
>>>>>>>> proposal through to you.
>>>>>>>
>>>>>>> Absolutely, this is how it works isn't it!
>>>>>>>
>>>>>>>> What I'm after first, is to allow applications deal w/ postings
>>>>>>>> caching
>>>>>>>> more
>>>>>>>> natively. For example, if you have a posting w/ payloads you'd like
>>>>>>>> to
>>>>>>>> read into memory, or if you would like a term's TermDocs to be cached
>>>>>>>> (to be used as a Filter) etc. -- instead of writing something that
>>>>>>>> can
>>>>>>>> work at the top IndexReader level, you'd be able to take advantage of
>>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments
>>>>>>>> ...
>>>>>>>
>>>>>>> I wonder if a custom codec would be the right place to implement
>>>>>>> caching / mem resident structures for Postings with payloads etc. You
>>>>>>> could do that on a higher level too but codec seems to be the way to
>>>>>>> go here, right?
>>>>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>>>>> readers would be sufficient or do I miss something?
>>>>>>>
>>>>>>> simon
>>>>>>>>
>>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache
>>>>>>>> to
>>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I
>>>>>>>> leave that for later.
>>>>>>>>
>>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like
>>>>>>>> to
>>>>>>>> know if the idea has any chances of making it to a commit :).
>>>>>>>>
>>>>>>>> Shai
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Danil ŢORIN <to...@gmail.com>.

I'd second that....

In my usecase we need to search, sometimes with sort, on pretty big index...

So in worst case scenario we get OOM while loading FieldCache as it
tries do create an huge array.
You can increase -Xmx, go to bigger host, but in the end there WILL be
an index big enough to crash you.

My idea would be to use something like EhCache with few elements in
memory and overflow to disk, so that if there are few unique terms, it
would be almost as fast as an array.
Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
EhCache on disk (yes it would be a huge performance hit) but I won't
get OOMs and the results STILL will be sorted.

Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
FieldCache.DEFAULT

And yes, I'm on 3.x...


On Mon, Sep 13, 2010 at 16:05, Tim Smith <ts...@attivio.com> wrote:
>  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
> proposing pretty much what seems to be discussed here
>
>
>  -- Tim
>
> On 09/12/10 10:18, Simon Willnauer wrote:
>>
>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
>> <lu...@mikemccandless.com>  wrote:
>>>
>>> Having hooks to enable an app to manage its own "external, private
>>> stuff associated w/ each segment reader" would be useful and it's been
>>> asked for in the past.  However, since we've now opened up
>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>>> already do this w/o core API changes?
>>
>> The visitor approach would simply be a little more than syntactic
>> sugar where only new SubReader instances are passed to the callback.
>> You can do the same with the already existing API like
>> gatherSubReaders or getSequentialSubReaders. Every API I was talking
>> about would just be simplification anyway and would be possible to
>> build without changing the core.
>>>
>>> I know Earwin has built a whole system like this on top of Lucene --
>>> Earwin how did you do that...?  Did you make core changes to
>>> Lucene...?
>>>
>>> A custom Codec should be an excellent way to handle the specific use
>>> cache (caching certain postings) -- by doing it as a Codec, any time
>>> anything in Lucene needs to tap into that posting (query scorers,
>>> filters, merging, applying deletes, etc), it hits this cache.  You
>>> could model it like PulsingCodec, which wraps any other Codec but
>>> handles the low-freq ones itself.  If you do it externally how would
>>> core use of postings hit it?  (Or was that not the intention?)
>>>
>>> I don't understand the filter use-case... the CachingWrapperFilter
>>> already caches per-segment, so that reopen is efficient?  How would an
>>> external cache (built on these hooks) be different?
>>
>> Man you are right - never mind :)
>>
>> simon
>>>
>>> For faster filters we have to apply them like we do deleted docs if
>>> the filter is "random access" (such as being cached), LUCENE-1536 --
>>> flex actually makes this relatively easy now, since the postings API
>>> no longer implicitly filters deleted docs (ie you provide your own
>>> skipDocs) -- but these hooks won't fix that right?
>>>
>>> Mike
>>>
>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>>> <si...@googlemail.com>  wrote:
>>>>
>>>> Hey Shai,
>>>>
>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<se...@gmail.com>  wrote:
>>>>>
>>>>> Hey Simon,
>>>>>
>>>>> You're right that the application can develop a Caching mechanism
>>>>> outside
>>>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>>>> sub-readers and init the Cache w/ the new ones.
>>>>
>>>> Alright, then we are on the same track I guess!
>>>>
>>>>> However, by building something like that inside Lucene, the application
>>>>> will
>>>>> get more native support, and thus better performance, in some cases.
>>>>> For
>>>>> example, consider a field fileType with 10 possible values, and for the
>>>>> sake
>>>>> of simplicity, let's say that the index is divided evenly across them.
>>>>> Your
>>>>> users always add such a term constraint to the query (e.g. they want to
>>>>> get
>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both,
>>>>> but not
>>>>> others). You have basically two ways of supporting this:
>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>>>> relation -- cons is that this term / posting is read for every query.
>>>>
>>>> Oh I wasn't saying that a cache framework would be obsolet and
>>>> shouldn't be part of lucene. My intention was rather to generalize
>>>> this functionality so that we can make the API change more easily and
>>>> at the same time brining the infrastructure you are proposing in
>>>> place.
>>>>
>>>> Regarding you example above, filters are a very good example where
>>>> something like that could help to improve performance and we should
>>>> provide it with lucene core but I would again prefer the least
>>>> intrusive way to do so. If we can make that happen without adding any
>>>> cache agnostic API we should do it. We really should try to sketch out
>>>> a simple API with gives us access to the opened segReaders and see if
>>>> that would be sufficient for our usecases. Specialization will always
>>>> be possible but I doubt that it is needed.
>>>>>
>>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>>> whenever the index is refreshed. This is better than (1), however has
>>>>> some
>>>>> disadvantages:
>>>>>
>>>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>>>> subject/number at the moment), if we could get Filter down to the lower
>>>>> level components of Lucene's search, so e.g. it is used as the deleted
>>>>> docs
>>>>> OBS, we can get better performance w/ Filters.
>>>>>
>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the
>>>>> changed
>>>>> segments. Reason is, outside Collector, you have no way of telling
>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>>>> Loading/refreshing the Filter may be expensive, and definitely won't
>>>>> perform
>>>>> well w/ NRT, where by definition you'd like to get small changes
>>>>> searchable
>>>>> very fast.
>>>>
>>>> No doubt you are right about the above. A
>>>> PerSegmentCachingFilterWrapper would be something we can easily do on
>>>> an application level basis with the infrastructure we are talking
>>>> about in place. While I don't exactly know how I feel that this
>>>> particular problem should rather be addressed internally and I'm not
>>>> sure if the high level Cache mechanism is the right way to do it but
>>>> this is just a gut feeling. But when I think about it twice it might
>>>> be way sufficient enough to do it....
>>>>>
>>>>> Therefore I think that if we could provide the necessary hooks in
>>>>> Lucene,
>>>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>>>> search process. I don't want to go too far into the design of a generic
>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache
>>>>> would
>>>>> become one of the Plugins the PluginProvider provides. I just try to
>>>>> learn
>>>>> from past experience -- when the discussion is focused, there's a
>>>>> better
>>>>> chance of getting to a resolution. However if you think that in this
>>>>> case, a
>>>>> more generic API, as PluginProvider, would get us to a resolution
>>>>> faster, I
>>>>> don't mind spend some time to think about it. But for all practical
>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like
>>>>> that,
>>>>> and if it catches, generify it ...
>>>>
>>>> I absolutely agree the API might be more generic but our current
>>>> use-case / PoC should be a caching. I don't like the name Plugin but
>>>> thats a personal thing since you are not pluggin anything is.
>>>> Something like SubreaderCallback or ReaderVisitor might be more
>>>> accurate but lets argue about the details later. Why not sketching
>>>> something out for the filter problem and follow on from there? The
>>>> more iteration the better and back to your question if that would be
>>>> something which could make it to be committable I would say if it
>>>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>>>
>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on
>>>>> 3x)
>>>>> so I can't comment on how feasible that solution is. I'll take your
>>>>> word for
>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>>>> Caching framework on trunk can be developed w/ Codecs.
>>>>
>>>> I guess nobody really has except of mike and maybe one or two others
>>>> but what I have done so far regarding codecs I would say that is the
>>>> place to solve this particular problem. Maybe even lower than that on
>>>> a Directory level. Anyhow, lets focus on application level caches for
>>>> now. We are not aiming to provide a whole full fledged Cache API but
>>>> the infrastructure to make it easier to build those on a app basis
>>>> which would be a valuable improvement. We should also look at Solr's
>>>> cache implementations and how they could benefit from this efforts
>>>> since Solr uses app-level caching we can learn from API design wise.
>>>>
>>>> simon
>>>>>
>>>>> Shai
>>>>>
>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>>> <si...@googlemail.com>  wrote:
>>>>>>
>>>>>> Hi Shai,
>>>>>>
>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<se...@gmail.com>  wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have
>>>>>>> been
>>>>>>> many proposals to attack this problem, w/ no developed solution.
>>>>>>
>>>>>> I didn't go through those issues so forgive me if something I bring up
>>>>>> has already been discussed.
>>>>>> I have a couple of question about your proposal - please find them
>>>>>> inline...
>>>>>>
>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach
>>>>>>> this
>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the
>>>>>>> application manage it, however Lucene will provide the necessary
>>>>>>> hooks
>>>>>>> in IndexReader to allow it. The hooks I have in mind are:
>>>>>>>
>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions
>>>>>>> etc.
>>>>>>> --
>>>>>>> already exists.
>>>>>>>
>>>>>>> (2) When reopen() is called, Lucene will take care to call a
>>>>>>> Cache.load(IndexReader), so that the application can pull whatever
>>>>>>> information
>>>>>>> it needs from the passed-in IndexReader.
>>>>>>
>>>>>> Would that do anything else than passing the new reader (if reopened)
>>>>>> to the caches load method? I wonder if this is more than
>>>>>> If(newReader != oldReader)
>>>>>>  Cache.load(newReader)
>>>>>>
>>>>>> If so something like that should be done on a segment reader anyway,
>>>>>> right? From my perspective this isn't more than a callback or visitor
>>>>>> that should walk though the subreaders and called for each reopened
>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>>>> and the API would be more general.
>>>>>>
>>>>>>
>>>>>>> So to be more concrete on my proposal, I'd like to support caching in
>>>>>>> the following way (and while I've spent some time thinking about it,
>>>>>>> I'm
>>>>>>> sure there are great suggestions to improve it):
>>>>>>>
>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen,
>>>>>>> which
>>>>>>> exposes some very simple API, such as createCache, or
>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object,
>>>>>>> which does not have very strict/concrete API.
>>>>>>
>>>>>> My first question would be why the reader should know about Cache if
>>>>>> there is no strict / concrete API?
>>>>>> I can follow you with the CacheFactory to create cache objects but why
>>>>>> would the reader have to know / "receive" this object? Maybe this is
>>>>>> answered further down the path but I don't see the reason why the
>>>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>>>> implemented in a more general "cache" - agnostic way.
>>>>>>>
>>>>>>> * IndexReader, most probably at the SegmentReader level uses
>>>>>>> CacheFactory to create a new Cache instance and calls its
>>>>>>> load(IndexReader) method, so that the Cache would initialize itself.
>>>>>>
>>>>>> That is what I was thinking above - yet is that more than a callback
>>>>>> for each reopened or opened segment reader?
>>>>>>
>>>>>>> * The application can use CacheFactory to obtain the Cache object per
>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can
>>>>>>> have IndexReader offer a getCache() method.
>>>>>>
>>>>>> :)  until here the cache is only used by the application itself not by
>>>>>> any Lucene API, right? I can think of many application specific data
>>>>>> that could be useful to be associated with an IR beyond the cacheing
>>>>>> use case - again this could be a more general API solving that
>>>>>> problem.
>>>>>>>
>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker
>>>>>>> one, and the application is free to impl it however it wants. That's
>>>>>>> a
>>>>>>> loose API, I know, but completely at the application hands, which
>>>>>>> makes
>>>>>>> Lucene code simpler.
>>>>>>
>>>>>> I like the idea together with the metadata associating functionality
>>>>>> from above something like public T IndexReader#get(Type<T>  type).
>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>>>> could be done in many ways but again "cache" - agnositc
>>>>>>>
>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than
>>>>>>> say TermDocs -- something w/ random access to the docs inside,
>>>>>>> perhaps
>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we
>>>>>>> create a
>>>>>>> CachingSegmentReader which makes use of the cache, and checks every
>>>>>>> time
>>>>>>> termDocs() is called if the required Term is cached or not etc. I
>>>>>>> admit
>>>>>>> I may be thinking too much ahead.
>>>>>>
>>>>>> I see what you are trying to do here. I also see how this could be
>>>>>> useful but I guess coming up with a stable APi which serves lots of
>>>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>>>> very simple decorator which would not require to touch the IR
>>>>>> interface. Something like that could be part of lucene but I'm not
>>>>>> sure if necessarily part of lucene core.
>>>>>>
>>>>>>> That's more or less what I've been thinking. I'm sure there are many
>>>>>>> details to iron out, but I hope I've managed to pass the general
>>>>>>> proposal through to you.
>>>>>>
>>>>>> Absolutely, this is how it works isn't it!
>>>>>>
>>>>>>> What I'm after first, is to allow applications deal w/ postings
>>>>>>> caching
>>>>>>> more
>>>>>>> natively. For example, if you have a posting w/ payloads you'd like
>>>>>>> to
>>>>>>> read into memory, or if you would like a term's TermDocs to be cached
>>>>>>> (to be used as a Filter) etc. -- instead of writing something that
>>>>>>> can
>>>>>>> work at the top IndexReader level, you'd be able to take advantage of
>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments
>>>>>>> ...
>>>>>>
>>>>>> I wonder if a custom codec would be the right place to implement
>>>>>> caching / mem resident structures for Postings with payloads etc. You
>>>>>> could do that on a higher level too but codec seems to be the way to
>>>>>> go here, right?
>>>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>>>> readers would be sufficient or do I miss something?
>>>>>>
>>>>>> simon
>>>>>>>
>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache
>>>>>>> to
>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I
>>>>>>> leave that for later.
>>>>>>>
>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like
>>>>>>> to
>>>>>>> know if the idea has any chances of making it to a commit :).
>>>>>>>
>>>>>>> Shai
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Tim Smith <ts...@attivio.com>.

  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time 
ago proposing pretty much what seems to be discussed here


  -- Tim

On 09/12/10 10:18, Simon Willnauer wrote:
> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
> <lu...@mikemccandless.com>  wrote:
>> Having hooks to enable an app to manage its own "external, private
>> stuff associated w/ each segment reader" would be useful and it's been
>> asked for in the past.  However, since we've now opened up
>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>> already do this w/o core API changes?
> The visitor approach would simply be a little more than syntactic
> sugar where only new SubReader instances are passed to the callback.
> You can do the same with the already existing API like
> gatherSubReaders or getSequentialSubReaders. Every API I was talking
> about would just be simplification anyway and would be possible to
> build without changing the core.
>> I know Earwin has built a whole system like this on top of Lucene --
>> Earwin how did you do that...?  Did you make core changes to
>> Lucene...?
>>
>> A custom Codec should be an excellent way to handle the specific use
>> cache (caching certain postings) -- by doing it as a Codec, any time
>> anything in Lucene needs to tap into that posting (query scorers,
>> filters, merging, applying deletes, etc), it hits this cache.  You
>> could model it like PulsingCodec, which wraps any other Codec but
>> handles the low-freq ones itself.  If you do it externally how would
>> core use of postings hit it?  (Or was that not the intention?)
>>
>> I don't understand the filter use-case... the CachingWrapperFilter
>> already caches per-segment, so that reopen is efficient?  How would an
>> external cache (built on these hooks) be different?
> Man you are right - never mind :)
>
> simon
>> For faster filters we have to apply them like we do deleted docs if
>> the filter is "random access" (such as being cached), LUCENE-1536 --
>> flex actually makes this relatively easy now, since the postings API
>> no longer implicitly filters deleted docs (ie you provide your own
>> skipDocs) -- but these hooks won't fix that right?
>>
>> Mike
>>
>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>> <si...@googlemail.com>  wrote:
>>> Hey Shai,
>>>
>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<se...@gmail.com>  wrote:
>>>> Hey Simon,
>>>>
>>>> You're right that the application can develop a Caching mechanism outside
>>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>>> sub-readers and init the Cache w/ the new ones.
>>> Alright, then we are on the same track I guess!
>>>
>>>> However, by building something like that inside Lucene, the application will
>>>> get more native support, and thus better performance, in some cases. For
>>>> example, consider a field fileType with 10 possible values, and for the sake
>>>> of simplicity, let's say that the index is divided evenly across them. Your
>>>> users always add such a term constraint to the query (e.g. they want to get
>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
>>>> others). You have basically two ways of supporting this:
>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>>> relation -- cons is that this term / posting is read for every query.
>>> Oh I wasn't saying that a cache framework would be obsolet and
>>> shouldn't be part of lucene. My intention was rather to generalize
>>> this functionality so that we can make the API change more easily and
>>> at the same time brining the infrastructure you are proposing in
>>> place.
>>>
>>> Regarding you example above, filters are a very good example where
>>> something like that could help to improve performance and we should
>>> provide it with lucene core but I would again prefer the least
>>> intrusive way to do so. If we can make that happen without adding any
>>> cache agnostic API we should do it. We really should try to sketch out
>>> a simple API with gives us access to the opened segReaders and see if
>>> that would be sufficient for our usecases. Specialization will always
>>> be possible but I doubt that it is needed.
>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>> whenever the index is refreshed. This is better than (1), however has some
>>>> disadvantages:
>>>>
>>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>>> subject/number at the moment), if we could get Filter down to the lower
>>>> level components of Lucene's search, so e.g. it is used as the deleted docs
>>>> OBS, we can get better performance w/ Filters.
>>>>
>>>> (2.2) The Filter is refreshed for the entire IR, and not just the changed
>>>> segments. Reason is, outside Collector, you have no way of telling
>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>>> Loading/refreshing the Filter may be expensive, and definitely won't perform
>>>> well w/ NRT, where by definition you'd like to get small changes searchable
>>>> very fast.
>>> No doubt you are right about the above. A
>>> PerSegmentCachingFilterWrapper would be something we can easily do on
>>> an application level basis with the infrastructure we are talking
>>> about in place. While I don't exactly know how I feel that this
>>> particular problem should rather be addressed internally and I'm not
>>> sure if the high level Cache mechanism is the right way to do it but
>>> this is just a gut feeling. But when I think about it twice it might
>>> be way sufficient enough to do it....
>>>> Therefore I think that if we could provide the necessary hooks in Lucene,
>>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>>> search process. I don't want to go too far into the design of a generic
>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache would
>>>> become one of the Plugins the PluginProvider provides. I just try to learn
>>>> from past experience -- when the discussion is focused, there's a better
>>>> chance of getting to a resolution. However if you think that in this case, a
>>>> more generic API, as PluginProvider, would get us to a resolution faster, I
>>>> don't mind spend some time to think about it. But for all practical
>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like that,
>>>> and if it catches, generify it ...
>>> I absolutely agree the API might be more generic but our current
>>> use-case / PoC should be a caching. I don't like the name Plugin but
>>> thats a personal thing since you are not pluggin anything is.
>>> Something like SubreaderCallback or ReaderVisitor might be more
>>> accurate but lets argue about the details later. Why not sketching
>>> something out for the filter problem and follow on from there? The
>>> more iteration the better and back to your question if that would be
>>> something which could make it to be committable I would say if it
>>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
>>>> so I can't comment on how feasible that solution is. I'll take your word for
>>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>>> Caching framework on trunk can be developed w/ Codecs.
>>> I guess nobody really has except of mike and maybe one or two others
>>> but what I have done so far regarding codecs I would say that is the
>>> place to solve this particular problem. Maybe even lower than that on
>>> a Directory level. Anyhow, lets focus on application level caches for
>>> now. We are not aiming to provide a whole full fledged Cache API but
>>> the infrastructure to make it easier to build those on a app basis
>>> which would be a valuable improvement. We should also look at Solr's
>>> cache implementations and how they could benefit from this efforts
>>> since Solr uses app-level caching we can learn from API design wise.
>>>
>>> simon
>>>> Shai
>>>>
>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>> <si...@googlemail.com>  wrote:
>>>>> Hi Shai,
>>>>>
>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<se...@gmail.com>  wrote:
>>>>>> Hi
>>>>>>
>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
>>>>>> many proposals to attack this problem, w/ no developed solution.
>>>>> I didn't go through those issues so forgive me if something I bring up
>>>>> has already been discussed.
>>>>> I have a couple of question about your proposal - please find them
>>>>> inline...
>>>>>
>>>>>> I'd like to explore a different, IMO much simpler, angle to attach this
>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the
>>>>>> application manage it, however Lucene will provide the necessary hooks
>>>>>> in IndexReader to allow it. The hooks I have in mind are:
>>>>>>
>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc.
>>>>>> --
>>>>>> already exists.
>>>>>>
>>>>>> (2) When reopen() is called, Lucene will take care to call a
>>>>>> Cache.load(IndexReader), so that the application can pull whatever
>>>>>> information
>>>>>> it needs from the passed-in IndexReader.
>>>>> Would that do anything else than passing the new reader (if reopened)
>>>>> to the caches load method? I wonder if this is more than
>>>>> If(newReader != oldReader)
>>>>>   Cache.load(newReader)
>>>>>
>>>>> If so something like that should be done on a segment reader anyway,
>>>>> right? From my perspective this isn't more than a callback or visitor
>>>>> that should walk though the subreaders and called for each reopened
>>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>>> and the API would be more general.
>>>>>
>>>>>
>>>>>> So to be more concrete on my proposal, I'd like to support caching in
>>>>>> the following way (and while I've spent some time thinking about it, I'm
>>>>>> sure there are great suggestions to improve it):
>>>>>>
>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen, which
>>>>>> exposes some very simple API, such as createCache, or
>>>>>> initCache(IndexReader) etc. Something which returns a Cache object,
>>>>>> which does not have very strict/concrete API.
>>>>> My first question would be why the reader should know about Cache if
>>>>> there is no strict / concrete API?
>>>>> I can follow you with the CacheFactory to create cache objects but why
>>>>> would the reader have to know / "receive" this object? Maybe this is
>>>>> answered further down the path but I don't see the reason why the
>>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>>> implemented in a more general "cache" - agnostic way.
>>>>>> * IndexReader, most probably at the SegmentReader level uses
>>>>>> CacheFactory to create a new Cache instance and calls its
>>>>>> load(IndexReader) method, so that the Cache would initialize itself.
>>>>> That is what I was thinking above - yet is that more than a callback
>>>>> for each reopened or opened segment reader?
>>>>>
>>>>>> * The application can use CacheFactory to obtain the Cache object per
>>>>>> IndexReader (for example, during Collector.setNextReader), or we can
>>>>>> have IndexReader offer a getCache() method.
>>>>> :)  until here the cache is only used by the application itself not by
>>>>> any Lucene API, right? I can think of many application specific data
>>>>> that could be useful to be associated with an IR beyond the cacheing
>>>>> use case - again this could be a more general API solving that
>>>>> problem.
>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker
>>>>>> one, and the application is free to impl it however it wants. That's a
>>>>>> loose API, I know, but completely at the application hands, which makes
>>>>>> Lucene code simpler.
>>>>> I like the idea together with the metadata associating functionality
>>>>> from above something like public T IndexReader#get(Type<T>  type).
>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>>> could be done in many ways but again "cache" - agnositc
>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>>>> provide the user w/ IndexReader-similar API, only more efficient than
>>>>>> say TermDocs -- something w/ random access to the docs inside, perhaps
>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we create a
>>>>>> CachingSegmentReader which makes use of the cache, and checks every time
>>>>>> termDocs() is called if the required Term is cached or not etc. I admit
>>>>>> I may be thinking too much ahead.
>>>>> I see what you are trying to do here. I also see how this could be
>>>>> useful but I guess coming up with a stable APi which serves lots of
>>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>>> very simple decorator which would not require to touch the IR
>>>>> interface. Something like that could be part of lucene but I'm not
>>>>> sure if necessarily part of lucene core.
>>>>>
>>>>>> That's more or less what I've been thinking. I'm sure there are many
>>>>>> details to iron out, but I hope I've managed to pass the general
>>>>>> proposal through to you.
>>>>> Absolutely, this is how it works isn't it!
>>>>>
>>>>>> What I'm after first, is to allow applications deal w/ postings caching
>>>>>> more
>>>>>> natively. For example, if you have a posting w/ payloads you'd like to
>>>>>> read into memory, or if you would like a term's TermDocs to be cached
>>>>>> (to be used as a Filter) etc. -- instead of writing something that can
>>>>>> work at the top IndexReader level, you'd be able to take advantage of
>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments ...
>>>>> I wonder if a custom codec would be the right place to implement
>>>>> caching / mem resident structures for Postings with payloads etc. You
>>>>> could do that on a higher level too but codec seems to be the way to
>>>>> go here, right?
>>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>>> readers would be sufficient or do I miss something?
>>>>>
>>>>> simon
>>>>>> I'm sure that after this will be in place, we can refactor FieldCache to
>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I
>>>>>> leave that for later.
>>>>>>
>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like to
>>>>>> know if the idea has any chances of making it to a commit :).
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Simon Willnauer <si...@googlemail.com>.

On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Having hooks to enable an app to manage its own "external, private
> stuff associated w/ each segment reader" would be useful and it's been
> asked for in the past.  However, since we've now opened up
> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
> already do this w/o core API changes?

The visitor approach would simply be a little more than syntactic
sugar where only new SubReader instances are passed to the callback.
You can do the same with the already existing API like
gatherSubReaders or getSequentialSubReaders. Every API I was talking
about would just be simplification anyway and would be possible to
build without changing the core.
>
> I know Earwin has built a whole system like this on top of Lucene --
> Earwin how did you do that...?  Did you make core changes to
> Lucene...?
>
> A custom Codec should be an excellent way to handle the specific use
> cache (caching certain postings) -- by doing it as a Codec, any time
> anything in Lucene needs to tap into that posting (query scorers,
> filters, merging, applying deletes, etc), it hits this cache.  You
> could model it like PulsingCodec, which wraps any other Codec but
> handles the low-freq ones itself.  If you do it externally how would
> core use of postings hit it?  (Or was that not the intention?)
>
> I don't understand the filter use-case... the CachingWrapperFilter
> already caches per-segment, so that reopen is efficient?  How would an
> external cache (built on these hooks) be different?

Man you are right - never mind :)

simon
>
> For faster filters we have to apply them like we do deleted docs if
> the filter is "random access" (such as being cached), LUCENE-1536 --
> flex actually makes this relatively easy now, since the postings API
> no longer implicitly filters deleted docs (ie you provide your own
> skipDocs) -- but these hooks won't fix that right?
>
> Mike
>
> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
> <si...@googlemail.com> wrote:
>> Hey Shai,
>>
>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <se...@gmail.com> wrote:
>>> Hey Simon,
>>>
>>> You're right that the application can develop a Caching mechanism outside
>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>> sub-readers and init the Cache w/ the new ones.
>>
>> Alright, then we are on the same track I guess!
>>
>>>
>>> However, by building something like that inside Lucene, the application will
>>> get more native support, and thus better performance, in some cases. For
>>> example, consider a field fileType with 10 possible values, and for the sake
>>> of simplicity, let's say that the index is divided evenly across them. Your
>>> users always add such a term constraint to the query (e.g. they want to get
>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
>>> others). You have basically two ways of supporting this:
>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>> relation -- cons is that this term / posting is read for every query.
>>
>> Oh I wasn't saying that a cache framework would be obsolet and
>> shouldn't be part of lucene. My intention was rather to generalize
>> this functionality so that we can make the API change more easily and
>> at the same time brining the infrastructure you are proposing in
>> place.
>>
>> Regarding you example above, filters are a very good example where
>> something like that could help to improve performance and we should
>> provide it with lucene core but I would again prefer the least
>> intrusive way to do so. If we can make that happen without adding any
>> cache agnostic API we should do it. We really should try to sketch out
>> a simple API with gives us access to the opened segReaders and see if
>> that would be sufficient for our usecases. Specialization will always
>> be possible but I doubt that it is needed.
>>>
>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>> whenever the index is refreshed. This is better than (1), however has some
>>> disadvantages:
>>>
>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>> subject/number at the moment), if we could get Filter down to the lower
>>> level components of Lucene's search, so e.g. it is used as the deleted docs
>>> OBS, we can get better performance w/ Filters.
>>>
>>> (2.2) The Filter is refreshed for the entire IR, and not just the changed
>>> segments. Reason is, outside Collector, you have no way of telling
>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>> Loading/refreshing the Filter may be expensive, and definitely won't perform
>>> well w/ NRT, where by definition you'd like to get small changes searchable
>>> very fast.
>>
>> No doubt you are right about the above. A
>> PerSegmentCachingFilterWrapper would be something we can easily do on
>> an application level basis with the infrastructure we are talking
>> about in place. While I don't exactly know how I feel that this
>> particular problem should rather be addressed internally and I'm not
>> sure if the high level Cache mechanism is the right way to do it but
>> this is just a gut feeling. But when I think about it twice it might
>> be way sufficient enough to do it....
>>>
>>> Therefore I think that if we could provide the necessary hooks in Lucene,
>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>> search process. I don't want to go too far into the design of a generic
>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>> reopen(PluginProvider) which is entirely not about Cache, and Cache would
>>> become one of the Plugins the PluginProvider provides. I just try to learn
>>> from past experience -- when the discussion is focused, there's a better
>>> chance of getting to a resolution. However if you think that in this case, a
>>> more generic API, as PluginProvider, would get us to a resolution faster, I
>>> don't mind spend some time to think about it. But for all practical
>>> purposes, we should IMO start w/ a Cache plug-in, that is called like that,
>>> and if it catches, generify it ...
>> I absolutely agree the API might be more generic but our current
>> use-case / PoC should be a caching. I don't like the name Plugin but
>> thats a personal thing since you are not pluggin anything is.
>> Something like SubreaderCallback or ReaderVisitor might be more
>> accurate but lets argue about the details later. Why not sketching
>> something out for the filter problem and follow on from there? The
>> more iteration the better and back to your question if that would be
>> something which could make it to be committable I would say if it
>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>
>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
>>> so I can't comment on how feasible that solution is. I'll take your word for
>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>> Caching framework on trunk can be developed w/ Codecs.
>>
>> I guess nobody really has except of mike and maybe one or two others
>> but what I have done so far regarding codecs I would say that is the
>> place to solve this particular problem. Maybe even lower than that on
>> a Directory level. Anyhow, lets focus on application level caches for
>> now. We are not aiming to provide a whole full fledged Cache API but
>> the infrastructure to make it easier to build those on a app basis
>> which would be a valuable improvement. We should also look at Solr's
>> cache implementations and how they could benefit from this efforts
>> since Solr uses app-level caching we can learn from API design wise.
>>
>> simon
>>>
>>> Shai
>>>
>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>> <si...@googlemail.com> wrote:
>>>>
>>>> Hi Shai,
>>>>
>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
>>>> > Hi
>>>> >
>>>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
>>>> > many proposals to attack this problem, w/ no developed solution.
>>>>
>>>> I didn't go through those issues so forgive me if something I bring up
>>>> has already been discussed.
>>>> I have a couple of question about your proposal - please find them
>>>> inline...
>>>>
>>>> >
>>>> > I'd like to explore a different, IMO much simpler, angle to attach this
>>>> > problem. Instead of having Lucene manage the Cache itself, we let the
>>>> > application manage it, however Lucene will provide the necessary hooks
>>>> > in IndexReader to allow it. The hooks I have in mind are:
>>>> >
>>>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc.
>>>> > --
>>>> > already exists.
>>>> >
>>>> > (2) When reopen() is called, Lucene will take care to call a
>>>> > Cache.load(IndexReader), so that the application can pull whatever
>>>> > information
>>>> > it needs from the passed-in IndexReader.
>>>> Would that do anything else than passing the new reader (if reopened)
>>>> to the caches load method? I wonder if this is more than
>>>> If(newReader != oldReader)
>>>>  Cache.load(newReader)
>>>>
>>>> If so something like that should be done on a segment reader anyway,
>>>> right? From my perspective this isn't more than a callback or visitor
>>>> that should walk though the subreaders and called for each reopened
>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>> and the API would be more general.
>>>>
>>>>
>>>> > So to be more concrete on my proposal, I'd like to support caching in
>>>> > the following way (and while I've spent some time thinking about it, I'm
>>>> > sure there are great suggestions to improve it):
>>>> >
>>>> > * Application provides a CacheFactory to IndexReader.open/reopen, which
>>>> > exposes some very simple API, such as createCache, or
>>>> > initCache(IndexReader) etc. Something which returns a Cache object,
>>>> > which does not have very strict/concrete API.
>>>>
>>>> My first question would be why the reader should know about Cache if
>>>> there is no strict / concrete API?
>>>> I can follow you with the CacheFactory to create cache objects but why
>>>> would the reader have to know / "receive" this object? Maybe this is
>>>> answered further down the path but I don't see the reason why the
>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>> implemented in a more general "cache" - agnostic way.
>>>> >
>>>> > * IndexReader, most probably at the SegmentReader level uses
>>>> > CacheFactory to create a new Cache instance and calls its
>>>> > load(IndexReader) method, so that the Cache would initialize itself.
>>>> That is what I was thinking above - yet is that more than a callback
>>>> for each reopened or opened segment reader?
>>>>
>>>> >
>>>> > * The application can use CacheFactory to obtain the Cache object per
>>>> > IndexReader (for example, during Collector.setNextReader), or we can
>>>> > have IndexReader offer a getCache() method.
>>>> :)  until here the cache is only used by the application itself not by
>>>> any Lucene API, right? I can think of many application specific data
>>>> that could be useful to be associated with an IR beyond the cacheing
>>>> use case - again this could be a more general API solving that
>>>> problem.
>>>> >
>>>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>> > Object, or an interface CacheType w/ no methods, just to be a marker
>>>> > one, and the application is free to impl it however it wants. That's a
>>>> > loose API, I know, but completely at the application hands, which makes
>>>> > Lucene code simpler.
>>>> I like the idea together with the metadata associating functionality
>>>> from above something like public T IndexReader#get(Type<T> type).
>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>> could be done in many ways but again "cache" - agnositc
>>>> >
>>>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>> > provide the user w/ IndexReader-similar API, only more efficient than
>>>> > say TermDocs -- something w/ random access to the docs inside, perhaps
>>>> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
>>>> > CachingSegmentReader which makes use of the cache, and checks every time
>>>> > termDocs() is called if the required Term is cached or not etc. I admit
>>>> > I may be thinking too much ahead.
>>>> I see what you are trying to do here. I also see how this could be
>>>> useful but I guess coming up with a stable APi which serves lots of
>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>> very simple decorator which would not require to touch the IR
>>>> interface. Something like that could be part of lucene but I'm not
>>>> sure if necessarily part of lucene core.
>>>>
>>>> > That's more or less what I've been thinking. I'm sure there are many
>>>> > details to iron out, but I hope I've managed to pass the general
>>>> > proposal through to you.
>>>>
>>>> Absolutely, this is how it works isn't it!
>>>>
>>>> >
>>>> > What I'm after first, is to allow applications deal w/ postings caching
>>>> > more
>>>> > natively. For example, if you have a posting w/ payloads you'd like to
>>>> > read into memory, or if you would like a term's TermDocs to be cached
>>>> > (to be used as a Filter) etc. -- instead of writing something that can
>>>> > work at the top IndexReader level, you'd be able to take advantage of
>>>> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>>>>
>>>> I wonder if a custom codec would be the right place to implement
>>>> caching / mem resident structures for Postings with payloads etc. You
>>>> could do that on a higher level too but codec seems to be the way to
>>>> go here, right?
>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>> readers would be sufficient or do I miss something?
>>>>
>>>> simon
>>>> >
>>>> > I'm sure that after this will be in place, we can refactor FieldCache to
>>>> > work w/ that API, perhaps as a Cache specific implementation. But I
>>>> > leave that for later.
>>>> >
>>>> > I'd appreciate your comments. Before I set to implement it, I'd like to
>>>> > know if the idea has any chances of making it to a commit :).
>>>> >
>>>> > Shai
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Sun, Sep 12, 2010 at 13:46, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Having hooks to enable an app to manage its own "external, private
> stuff associated w/ each segment reader" would be useful and it's been
> asked for in the past.  However, since we've now opened up
> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
> already do this w/o core API changes?
>
> I know Earwin has built a whole system like this on top of Lucene --
> Earwin how did you do that...?  Did you make core changes to
> Lucene...?

I did implement generic plugins for a SR/MSR and friends
over 2.9-trunk lucene, and that's a core change indeed.

They didn't handle IW.getReader case, and I started working
on that (along with a major IR.clone/reopen cleanup - LUCENE-2355),
but got sidetracked.

There's still hope I get back to them in a nearest couple of months :)

> A custom Codec should be an excellent way to handle the specific use
> cache (caching certain postings) -- by doing it as a Codec, any time
> anything in Lucene needs to tap into that posting (query scorers,
> filters, merging, applying deletes, etc), it hits this cache.  You
> could model it like PulsingCodec, which wraps any other Codec but
> handles the low-freq ones itself.  If you do it externally how would
> core use of postings hit it?  (Or was that not the intention?)
>
> I don't understand the filter use-case... the CachingWrapperFilter
> already caches per-segment, so that reopen is efficient?  How would an
> external cache (built on these hooks) be different?
>
> For faster filters we have to apply them like we do deleted docs if
> the filter is "random access" (such as being cached), LUCENE-1536 --
> flex actually makes this relatively easy now, since the postings API
> no longer implicitly filters deleted docs (ie you provide your own
> skipDocs) -- but these hooks won't fix that right?
>
> Mike
>
> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
> <si...@googlemail.com> wrote:
>> Hey Shai,
>>
>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <se...@gmail.com> wrote:
>>> Hey Simon,
>>>
>>> You're right that the application can develop a Caching mechanism outside
>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>> sub-readers and init the Cache w/ the new ones.
>>
>> Alright, then we are on the same track I guess!
>>
>>>
>>> However, by building something like that inside Lucene, the application will
>>> get more native support, and thus better performance, in some cases. For
>>> example, consider a field fileType with 10 possible values, and for the sake
>>> of simplicity, let's say that the index is divided evenly across them. Your
>>> users always add such a term constraint to the query (e.g. they want to get
>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
>>> others). You have basically two ways of supporting this:
>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>> relation -- cons is that this term / posting is read for every query.
>>
>> Oh I wasn't saying that a cache framework would be obsolet and
>> shouldn't be part of lucene. My intention was rather to generalize
>> this functionality so that we can make the API change more easily and
>> at the same time brining the infrastructure you are proposing in
>> place.
>>
>> Regarding you example above, filters are a very good example where
>> something like that could help to improve performance and we should
>> provide it with lucene core but I would again prefer the least
>> intrusive way to do so. If we can make that happen without adding any
>> cache agnostic API we should do it. We really should try to sketch out
>> a simple API with gives us access to the opened segReaders and see if
>> that would be sufficient for our usecases. Specialization will always
>> be possible but I doubt that it is needed.
>>>
>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>> whenever the index is refreshed. This is better than (1), however has some
>>> disadvantages:
>>>
>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>> subject/number at the moment), if we could get Filter down to the lower
>>> level components of Lucene's search, so e.g. it is used as the deleted docs
>>> OBS, we can get better performance w/ Filters.
>>>
>>> (2.2) The Filter is refreshed for the entire IR, and not just the changed
>>> segments. Reason is, outside Collector, you have no way of telling
>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>> Loading/refreshing the Filter may be expensive, and definitely won't perform
>>> well w/ NRT, where by definition you'd like to get small changes searchable
>>> very fast.
>>
>> No doubt you are right about the above. A
>> PerSegmentCachingFilterWrapper would be something we can easily do on
>> an application level basis with the infrastructure we are talking
>> about in place. While I don't exactly know how I feel that this
>> particular problem should rather be addressed internally and I'm not
>> sure if the high level Cache mechanism is the right way to do it but
>> this is just a gut feeling. But when I think about it twice it might
>> be way sufficient enough to do it....
>>>
>>> Therefore I think that if we could provide the necessary hooks in Lucene,
>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>> search process. I don't want to go too far into the design of a generic
>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>> reopen(PluginProvider) which is entirely not about Cache, and Cache would
>>> become one of the Plugins the PluginProvider provides. I just try to learn
>>> from past experience -- when the discussion is focused, there's a better
>>> chance of getting to a resolution. However if you think that in this case, a
>>> more generic API, as PluginProvider, would get us to a resolution faster, I
>>> don't mind spend some time to think about it. But for all practical
>>> purposes, we should IMO start w/ a Cache plug-in, that is called like that,
>>> and if it catches, generify it ...
>> I absolutely agree the API might be more generic but our current
>> use-case / PoC should be a caching. I don't like the name Plugin but
>> thats a personal thing since you are not pluggin anything is.
>> Something like SubreaderCallback or ReaderVisitor might be more
>> accurate but lets argue about the details later. Why not sketching
>> something out for the filter problem and follow on from there? The
>> more iteration the better and back to your question if that would be
>> something which could make it to be committable I would say if it
>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>
>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
>>> so I can't comment on how feasible that solution is. I'll take your word for
>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>> Caching framework on trunk can be developed w/ Codecs.
>>
>> I guess nobody really has except of mike and maybe one or two others
>> but what I have done so far regarding codecs I would say that is the
>> place to solve this particular problem. Maybe even lower than that on
>> a Directory level. Anyhow, lets focus on application level caches for
>> now. We are not aiming to provide a whole full fledged Cache API but
>> the infrastructure to make it easier to build those on a app basis
>> which would be a valuable improvement. We should also look at Solr's
>> cache implementations and how they could benefit from this efforts
>> since Solr uses app-level caching we can learn from API design wise.
>>
>> simon
>>>
>>> Shai
>>>
>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>> <si...@googlemail.com> wrote:
>>>>
>>>> Hi Shai,
>>>>
>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
>>>> > Hi
>>>> >
>>>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
>>>> > many proposals to attack this problem, w/ no developed solution.
>>>>
>>>> I didn't go through those issues so forgive me if something I bring up
>>>> has already been discussed.
>>>> I have a couple of question about your proposal - please find them
>>>> inline...
>>>>
>>>> >
>>>> > I'd like to explore a different, IMO much simpler, angle to attach this
>>>> > problem. Instead of having Lucene manage the Cache itself, we let the
>>>> > application manage it, however Lucene will provide the necessary hooks
>>>> > in IndexReader to allow it. The hooks I have in mind are:
>>>> >
>>>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc.
>>>> > --
>>>> > already exists.
>>>> >
>>>> > (2) When reopen() is called, Lucene will take care to call a
>>>> > Cache.load(IndexReader), so that the application can pull whatever
>>>> > information
>>>> > it needs from the passed-in IndexReader.
>>>> Would that do anything else than passing the new reader (if reopened)
>>>> to the caches load method? I wonder if this is more than
>>>> If(newReader != oldReader)
>>>>  Cache.load(newReader)
>>>>
>>>> If so something like that should be done on a segment reader anyway,
>>>> right? From my perspective this isn't more than a callback or visitor
>>>> that should walk though the subreaders and called for each reopened
>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>> and the API would be more general.
>>>>
>>>>
>>>> > So to be more concrete on my proposal, I'd like to support caching in
>>>> > the following way (and while I've spent some time thinking about it, I'm
>>>> > sure there are great suggestions to improve it):
>>>> >
>>>> > * Application provides a CacheFactory to IndexReader.open/reopen, which
>>>> > exposes some very simple API, such as createCache, or
>>>> > initCache(IndexReader) etc. Something which returns a Cache object,
>>>> > which does not have very strict/concrete API.
>>>>
>>>> My first question would be why the reader should know about Cache if
>>>> there is no strict / concrete API?
>>>> I can follow you with the CacheFactory to create cache objects but why
>>>> would the reader have to know / "receive" this object? Maybe this is
>>>> answered further down the path but I don't see the reason why the
>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>> implemented in a more general "cache" - agnostic way.
>>>> >
>>>> > * IndexReader, most probably at the SegmentReader level uses
>>>> > CacheFactory to create a new Cache instance and calls its
>>>> > load(IndexReader) method, so that the Cache would initialize itself.
>>>> That is what I was thinking above - yet is that more than a callback
>>>> for each reopened or opened segment reader?
>>>>
>>>> >
>>>> > * The application can use CacheFactory to obtain the Cache object per
>>>> > IndexReader (for example, during Collector.setNextReader), or we can
>>>> > have IndexReader offer a getCache() method.
>>>> :)  until here the cache is only used by the application itself not by
>>>> any Lucene API, right? I can think of many application specific data
>>>> that could be useful to be associated with an IR beyond the cacheing
>>>> use case - again this could be a more general API solving that
>>>> problem.
>>>> >
>>>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>> > Object, or an interface CacheType w/ no methods, just to be a marker
>>>> > one, and the application is free to impl it however it wants. That's a
>>>> > loose API, I know, but completely at the application hands, which makes
>>>> > Lucene code simpler.
>>>> I like the idea together with the metadata associating functionality
>>>> from above something like public T IndexReader#get(Type<T> type).
>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>> could be done in many ways but again "cache" - agnositc
>>>> >
>>>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>> > provide the user w/ IndexReader-similar API, only more efficient than
>>>> > say TermDocs -- something w/ random access to the docs inside, perhaps
>>>> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
>>>> > CachingSegmentReader which makes use of the cache, and checks every time
>>>> > termDocs() is called if the required Term is cached or not etc. I admit
>>>> > I may be thinking too much ahead.
>>>> I see what you are trying to do here. I also see how this could be
>>>> useful but I guess coming up with a stable APi which serves lots of
>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>> very simple decorator which would not require to touch the IR
>>>> interface. Something like that could be part of lucene but I'm not
>>>> sure if necessarily part of lucene core.
>>>>
>>>> > That's more or less what I've been thinking. I'm sure there are many
>>>> > details to iron out, but I hope I've managed to pass the general
>>>> > proposal through to you.
>>>>
>>>> Absolutely, this is how it works isn't it!
>>>>
>>>> >
>>>> > What I'm after first, is to allow applications deal w/ postings caching
>>>> > more
>>>> > natively. For example, if you have a posting w/ payloads you'd like to
>>>> > read into memory, or if you would like a term's TermDocs to be cached
>>>> > (to be used as a Filter) etc. -- instead of writing something that can
>>>> > work at the top IndexReader level, you'd be able to take advantage of
>>>> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>>>>
>>>> I wonder if a custom codec would be the right place to implement
>>>> caching / mem resident structures for Postings with payloads etc. You
>>>> could do that on a higher level too but codec seems to be the way to
>>>> go here, right?
>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>> readers would be sufficient or do I miss something?
>>>>
>>>> simon
>>>> >
>>>> > I'm sure that after this will be in place, we can refactor FieldCache to
>>>> > work w/ that API, perhaps as a Cache specific implementation. But I
>>>> > leave that for later.
>>>> >
>>>> > I'd appreciate your comments. Before I set to implement it, I'd like to
>>>> > know if the idea has any chances of making it to a commit :).
>>>> >
>>>> > Shai
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Michael McCandless <lu...@mikemccandless.com>.

Having hooks to enable an app to manage its own "external, private
stuff associated w/ each segment reader" would be useful and it's been
asked for in the past.  However, since we've now opened up
SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
already do this w/o core API changes?

I know Earwin has built a whole system like this on top of Lucene --
Earwin how did you do that...?  Did you make core changes to
Lucene...?

A custom Codec should be an excellent way to handle the specific use
cache (caching certain postings) -- by doing it as a Codec, any time
anything in Lucene needs to tap into that posting (query scorers,
filters, merging, applying deletes, etc), it hits this cache.  You
could model it like PulsingCodec, which wraps any other Codec but
handles the low-freq ones itself.  If you do it externally how would
core use of postings hit it?  (Or was that not the intention?)

I don't understand the filter use-case... the CachingWrapperFilter
already caches per-segment, so that reopen is efficient?  How would an
external cache (built on these hooks) be different?

For faster filters we have to apply them like we do deleted docs if
the filter is "random access" (such as being cached), LUCENE-1536 --
flex actually makes this relatively easy now, since the postings API
no longer implicitly filters deleted docs (ie you provide your own
skipDocs) -- but these hooks won't fix that right?

Mike

On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
<si...@googlemail.com> wrote:
> Hey Shai,
>
> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <se...@gmail.com> wrote:
>> Hey Simon,
>>
>> You're right that the application can develop a Caching mechanism outside
>> Lucene, and when reopen() is called, if it changed, iterate on the
>> sub-readers and init the Cache w/ the new ones.
>
> Alright, then we are on the same track I guess!
>
>>
>> However, by building something like that inside Lucene, the application will
>> get more native support, and thus better performance, in some cases. For
>> example, consider a field fileType with 10 possible values, and for the sake
>> of simplicity, let's say that the index is divided evenly across them. Your
>> users always add such a term constraint to the query (e.g. they want to get
>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
>> others). You have basically two ways of supporting this:
>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>> relation -- cons is that this term / posting is read for every query.
>
> Oh I wasn't saying that a cache framework would be obsolet and
> shouldn't be part of lucene. My intention was rather to generalize
> this functionality so that we can make the API change more easily and
> at the same time brining the infrastructure you are proposing in
> place.
>
> Regarding you example above, filters are a very good example where
> something like that could help to improve performance and we should
> provide it with lucene core but I would again prefer the least
> intrusive way to do so. If we can make that happen without adding any
> cache agnostic API we should do it. We really should try to sketch out
> a simple API with gives us access to the opened segReaders and see if
> that would be sufficient for our usecases. Specialization will always
> be possible but I doubt that it is needed.
>>
>> (2) Write a Filter which works at the top IR level, that is refreshed
>> whenever the index is refreshed. This is better than (1), however has some
>> disadvantages:
>>
>> (2.1) As Mike already proved (on some issue which I don't remember its
>> subject/number at the moment), if we could get Filter down to the lower
>> level components of Lucene's search, so e.g. it is used as the deleted docs
>> OBS, we can get better performance w/ Filters.
>>
>> (2.2) The Filter is refreshed for the entire IR, and not just the changed
>> segments. Reason is, outside Collector, you have no way of telling
>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>> Loading/refreshing the Filter may be expensive, and definitely won't perform
>> well w/ NRT, where by definition you'd like to get small changes searchable
>> very fast.
>
> No doubt you are right about the above. A
> PerSegmentCachingFilterWrapper would be something we can easily do on
> an application level basis with the infrastructure we are talking
> about in place. While I don't exactly know how I feel that this
> particular problem should rather be addressed internally and I'm not
> sure if the high level Cache mechanism is the right way to do it but
> this is just a gut feeling. But when I think about it twice it might
> be way sufficient enough to do it....
>>
>> Therefore I think that if we could provide the necessary hooks in Lucene,
>> let's call it a Cache plug-in for now, we can incrementally improve the
>> search process. I don't want to go too far into the design of a generic
>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>> reopen(PluginProvider) which is entirely not about Cache, and Cache would
>> become one of the Plugins the PluginProvider provides. I just try to learn
>> from past experience -- when the discussion is focused, there's a better
>> chance of getting to a resolution. However if you think that in this case, a
>> more generic API, as PluginProvider, would get us to a resolution faster, I
>> don't mind spend some time to think about it. But for all practical
>> purposes, we should IMO start w/ a Cache plug-in, that is called like that,
>> and if it catches, generify it ...
> I absolutely agree the API might be more generic but our current
> use-case / PoC should be a caching. I don't like the name Plugin but
> thats a personal thing since you are not pluggin anything is.
> Something like SubreaderCallback or ReaderVisitor might be more
> accurate but lets argue about the details later. Why not sketching
> something out for the filter problem and follow on from there? The
> more iteration the better and back to your question if that would be
> something which could make it to be committable I would say if it
> works stand alone / not to tightly coupled I would absolutely say yes.
>>
>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
>> so I can't comment on how feasible that solution is. I'll take your word for
>> it that it's doable :). But this doesn't give us a 3x solution ... the
>> Caching framework on trunk can be developed w/ Codecs.
>
> I guess nobody really has except of mike and maybe one or two others
> but what I have done so far regarding codecs I would say that is the
> place to solve this particular problem. Maybe even lower than that on
> a Directory level. Anyhow, lets focus on application level caches for
> now. We are not aiming to provide a whole full fledged Cache API but
> the infrastructure to make it easier to build those on a app basis
> which would be a valuable improvement. We should also look at Solr's
> cache implementations and how they could benefit from this efforts
> since Solr uses app-level caching we can learn from API design wise.
>
> simon
>>
>> Shai
>>
>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>> <si...@googlemail.com> wrote:
>>>
>>> Hi Shai,
>>>
>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
>>> > Hi
>>> >
>>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
>>> > many proposals to attack this problem, w/ no developed solution.
>>>
>>> I didn't go through those issues so forgive me if something I bring up
>>> has already been discussed.
>>> I have a couple of question about your proposal - please find them
>>> inline...
>>>
>>> >
>>> > I'd like to explore a different, IMO much simpler, angle to attach this
>>> > problem. Instead of having Lucene manage the Cache itself, we let the
>>> > application manage it, however Lucene will provide the necessary hooks
>>> > in IndexReader to allow it. The hooks I have in mind are:
>>> >
>>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc.
>>> > --
>>> > already exists.
>>> >
>>> > (2) When reopen() is called, Lucene will take care to call a
>>> > Cache.load(IndexReader), so that the application can pull whatever
>>> > information
>>> > it needs from the passed-in IndexReader.
>>> Would that do anything else than passing the new reader (if reopened)
>>> to the caches load method? I wonder if this is more than
>>> If(newReader != oldReader)
>>>  Cache.load(newReader)
>>>
>>> If so something like that should be done on a segment reader anyway,
>>> right? From my perspective this isn't more than a callback or visitor
>>> that should walk though the subreaders and called for each reopened
>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>> and the API would be more general.
>>>
>>>
>>> > So to be more concrete on my proposal, I'd like to support caching in
>>> > the following way (and while I've spent some time thinking about it, I'm
>>> > sure there are great suggestions to improve it):
>>> >
>>> > * Application provides a CacheFactory to IndexReader.open/reopen, which
>>> > exposes some very simple API, such as createCache, or
>>> > initCache(IndexReader) etc. Something which returns a Cache object,
>>> > which does not have very strict/concrete API.
>>>
>>> My first question would be why the reader should know about Cache if
>>> there is no strict / concrete API?
>>> I can follow you with the CacheFactory to create cache objects but why
>>> would the reader have to know / "receive" this object? Maybe this is
>>> answered further down the path but I don't see the reason why the
>>> notion of a "cache" must exist within open/reopen or if that could be
>>> implemented in a more general "cache" - agnostic way.
>>> >
>>> > * IndexReader, most probably at the SegmentReader level uses
>>> > CacheFactory to create a new Cache instance and calls its
>>> > load(IndexReader) method, so that the Cache would initialize itself.
>>> That is what I was thinking above - yet is that more than a callback
>>> for each reopened or opened segment reader?
>>>
>>> >
>>> > * The application can use CacheFactory to obtain the Cache object per
>>> > IndexReader (for example, during Collector.setNextReader), or we can
>>> > have IndexReader offer a getCache() method.
>>> :)  until here the cache is only used by the application itself not by
>>> any Lucene API, right? I can think of many application specific data
>>> that could be useful to be associated with an IR beyond the cacheing
>>> use case - again this could be a more general API solving that
>>> problem.
>>> >
>>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>> > Object, or an interface CacheType w/ no methods, just to be a marker
>>> > one, and the application is free to impl it however it wants. That's a
>>> > loose API, I know, but completely at the application hands, which makes
>>> > Lucene code simpler.
>>> I like the idea together with the metadata associating functionality
>>> from above something like public T IndexReader#get(Type<T> type).
>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>> could be done in many ways but again "cache" - agnositc
>>> >
>>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>> > provide the user w/ IndexReader-similar API, only more efficient than
>>> > say TermDocs -- something w/ random access to the docs inside, perhaps
>>> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
>>> > CachingSegmentReader which makes use of the cache, and checks every time
>>> > termDocs() is called if the required Term is cached or not etc. I admit
>>> > I may be thinking too much ahead.
>>> I see what you are trying to do here. I also see how this could be
>>> useful but I guess coming up with a stable APi which serves lots of
>>> applications would be quiet hard. A CachingSegmentReader could be a
>>> very simple decorator which would not require to touch the IR
>>> interface. Something like that could be part of lucene but I'm not
>>> sure if necessarily part of lucene core.
>>>
>>> > That's more or less what I've been thinking. I'm sure there are many
>>> > details to iron out, but I hope I've managed to pass the general
>>> > proposal through to you.
>>>
>>> Absolutely, this is how it works isn't it!
>>>
>>> >
>>> > What I'm after first, is to allow applications deal w/ postings caching
>>> > more
>>> > natively. For example, if you have a posting w/ payloads you'd like to
>>> > read into memory, or if you would like a term's TermDocs to be cached
>>> > (to be used as a Filter) etc. -- instead of writing something that can
>>> > work at the top IndexReader level, you'd be able to take advantage of
>>> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>>>
>>> I wonder if a custom codec would be the right place to implement
>>> caching / mem resident structures for Postings with payloads etc. You
>>> could do that on a higher level too but codec seems to be the way to
>>> go here, right?
>>> To utilize per segment capabilities a callback for (re)opened segment
>>> readers would be sufficient or do I miss something?
>>>
>>> simon
>>> >
>>> > I'm sure that after this will be in place, we can refactor FieldCache to
>>> > work w/ that API, perhaps as a Cache specific implementation. But I
>>> > leave that for later.
>>> >
>>> > I'd appreciate your comments. Before I set to implement it, I'd like to
>>> > know if the idea has any chances of making it to a commit :).
>>> >
>>> > Shai
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Simon Willnauer <si...@googlemail.com>.

Hey Shai,

On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <se...@gmail.com> wrote:
> Hey Simon,
>
> You're right that the application can develop a Caching mechanism outside
> Lucene, and when reopen() is called, if it changed, iterate on the
> sub-readers and init the Cache w/ the new ones.

Alright, then we are on the same track I guess!

>
> However, by building something like that inside Lucene, the application will
> get more native support, and thus better performance, in some cases. For
> example, consider a field fileType with 10 possible values, and for the sake
> of simplicity, let's say that the index is divided evenly across them. Your
> users always add such a term constraint to the query (e.g. they want to get
> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
> others). You have basically two ways of supporting this:
> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
> relation -- cons is that this term / posting is read for every query.

Oh I wasn't saying that a cache framework would be obsolet and
shouldn't be part of lucene. My intention was rather to generalize
this functionality so that we can make the API change more easily and
at the same time brining the infrastructure you are proposing in
place.

Regarding you example above, filters are a very good example where
something like that could help to improve performance and we should
provide it with lucene core but I would again prefer the least
intrusive way to do so. If we can make that happen without adding any
cache agnostic API we should do it. We really should try to sketch out
a simple API with gives us access to the opened segReaders and see if
that would be sufficient for our usecases. Specialization will always
be possible but I doubt that it is needed.
>
> (2) Write a Filter which works at the top IR level, that is refreshed
> whenever the index is refreshed. This is better than (1), however has some
> disadvantages:
>
> (2.1) As Mike already proved (on some issue which I don't remember its
> subject/number at the moment), if we could get Filter down to the lower
> level components of Lucene's search, so e.g. it is used as the deleted docs
> OBS, we can get better performance w/ Filters.
>
> (2.2) The Filter is refreshed for the entire IR, and not just the changed
> segments. Reason is, outside Collector, you have no way of telling
> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
> Loading/refreshing the Filter may be expensive, and definitely won't perform
> well w/ NRT, where by definition you'd like to get small changes searchable
> very fast.

No doubt you are right about the above. A
PerSegmentCachingFilterWrapper would be something we can easily do on
an application level basis with the infrastructure we are talking
about in place. While I don't exactly know how I feel that this
particular problem should rather be addressed internally and I'm not
sure if the high level Cache mechanism is the right way to do it but
this is just a gut feeling. But when I think about it twice it might
be way sufficient enough to do it....
>
> Therefore I think that if we could provide the necessary hooks in Lucene,
> let's call it a Cache plug-in for now, we can incrementally improve the
> search process. I don't want to go too far into the design of a generic
> plug-ins mechanism, but you're right (again :)) -- we could offer a
> reopen(PluginProvider) which is entirely not about Cache, and Cache would
> become one of the Plugins the PluginProvider provides. I just try to learn
> from past experience -- when the discussion is focused, there's a better
> chance of getting to a resolution. However if you think that in this case, a
> more generic API, as PluginProvider, would get us to a resolution faster, I
> don't mind spend some time to think about it. But for all practical
> purposes, we should IMO start w/ a Cache plug-in, that is called like that,
> and if it catches, generify it ...
I absolutely agree the API might be more generic but our current
use-case / PoC should be a caching. I don't like the name Plugin but
thats a personal thing since you are not pluggin anything is.
Something like SubreaderCallback or ReaderVisitor might be more
accurate but lets argue about the details later. Why not sketching
something out for the filter problem and follow on from there? The
more iteration the better and back to your question if that would be
something which could make it to be committable I would say if it
works stand alone / not to tightly coupled I would absolutely say yes.
>
> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
> so I can't comment on how feasible that solution is. I'll take your word for
> it that it's doable :). But this doesn't give us a 3x solution ... the
> Caching framework on trunk can be developed w/ Codecs.

I guess nobody really has except of mike and maybe one or two others
but what I have done so far regarding codecs I would say that is the
place to solve this particular problem. Maybe even lower than that on
a Directory level. Anyhow, lets focus on application level caches for
now. We are not aiming to provide a whole full fledged Cache API but
the infrastructure to make it easier to build those on a app basis
which would be a valuable improvement. We should also look at Solr's
cache implementations and how they could benefit from this efforts
since Solr uses app-level caching we can learn from API design wise.

simon
>
> Shai
>
> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
> <si...@googlemail.com> wrote:
>>
>> Hi Shai,
>>
>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
>> > Hi
>> >
>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
>> > many proposals to attack this problem, w/ no developed solution.
>>
>> I didn't go through those issues so forgive me if something I bring up
>> has already been discussed.
>> I have a couple of question about your proposal - please find them
>> inline...
>>
>> >
>> > I'd like to explore a different, IMO much simpler, angle to attach this
>> > problem. Instead of having Lucene manage the Cache itself, we let the
>> > application manage it, however Lucene will provide the necessary hooks
>> > in IndexReader to allow it. The hooks I have in mind are:
>> >
>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc.
>> > --
>> > already exists.
>> >
>> > (2) When reopen() is called, Lucene will take care to call a
>> > Cache.load(IndexReader), so that the application can pull whatever
>> > information
>> > it needs from the passed-in IndexReader.
>> Would that do anything else than passing the new reader (if reopened)
>> to the caches load method? I wonder if this is more than
>> If(newReader != oldReader)
>>  Cache.load(newReader)
>>
>> If so something like that should be done on a segment reader anyway,
>> right? From my perspective this isn't more than a callback or visitor
>> that should walk though the subreaders and called for each reopened
>> sub-reader. A cache-warming visitor / callback would then be trivial
>> and the API would be more general.
>>
>>
>> > So to be more concrete on my proposal, I'd like to support caching in
>> > the following way (and while I've spent some time thinking about it, I'm
>> > sure there are great suggestions to improve it):
>> >
>> > * Application provides a CacheFactory to IndexReader.open/reopen, which
>> > exposes some very simple API, such as createCache, or
>> > initCache(IndexReader) etc. Something which returns a Cache object,
>> > which does not have very strict/concrete API.
>>
>> My first question would be why the reader should know about Cache if
>> there is no strict / concrete API?
>> I can follow you with the CacheFactory to create cache objects but why
>> would the reader have to know / "receive" this object? Maybe this is
>> answered further down the path but I don't see the reason why the
>> notion of a "cache" must exist within open/reopen or if that could be
>> implemented in a more general "cache" - agnostic way.
>> >
>> > * IndexReader, most probably at the SegmentReader level uses
>> > CacheFactory to create a new Cache instance and calls its
>> > load(IndexReader) method, so that the Cache would initialize itself.
>> That is what I was thinking above - yet is that more than a callback
>> for each reopened or opened segment reader?
>>
>> >
>> > * The application can use CacheFactory to obtain the Cache object per
>> > IndexReader (for example, during Collector.setNextReader), or we can
>> > have IndexReader offer a getCache() method.
>> :)  until here the cache is only used by the application itself not by
>> any Lucene API, right? I can think of many application specific data
>> that could be useful to be associated with an IR beyond the cacheing
>> use case - again this could be a more general API solving that
>> problem.
>> >
>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
>> > Object, or an interface CacheType w/ no methods, just to be a marker
>> > one, and the application is free to impl it however it wants. That's a
>> > loose API, I know, but completely at the application hands, which makes
>> > Lucene code simpler.
>> I like the idea together with the metadata associating functionality
>> from above something like public T IndexReader#get(Type<T> type).
>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>> could be done in many ways but again "cache" - agnositc
>> >
>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>> > provide the user w/ IndexReader-similar API, only more efficient than
>> > say TermDocs -- something w/ random access to the docs inside, perhaps
>> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
>> > CachingSegmentReader which makes use of the cache, and checks every time
>> > termDocs() is called if the required Term is cached or not etc. I admit
>> > I may be thinking too much ahead.
>> I see what you are trying to do here. I also see how this could be
>> useful but I guess coming up with a stable APi which serves lots of
>> applications would be quiet hard. A CachingSegmentReader could be a
>> very simple decorator which would not require to touch the IR
>> interface. Something like that could be part of lucene but I'm not
>> sure if necessarily part of lucene core.
>>
>> > That's more or less what I've been thinking. I'm sure there are many
>> > details to iron out, but I hope I've managed to pass the general
>> > proposal through to you.
>>
>> Absolutely, this is how it works isn't it!
>>
>> >
>> > What I'm after first, is to allow applications deal w/ postings caching
>> > more
>> > natively. For example, if you have a posting w/ payloads you'd like to
>> > read into memory, or if you would like a term's TermDocs to be cached
>> > (to be used as a Filter) etc. -- instead of writing something that can
>> > work at the top IndexReader level, you'd be able to take advantage of
>> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>>
>> I wonder if a custom codec would be the right place to implement
>> caching / mem resident structures for Postings with payloads etc. You
>> could do that on a higher level too but codec seems to be the way to
>> go here, right?
>> To utilize per segment capabilities a callback for (re)opened segment
>> readers would be sufficient or do I miss something?
>>
>> simon
>> >
>> > I'm sure that after this will be in place, we can refactor FieldCache to
>> > work w/ that API, perhaps as a Cache specific implementation. But I
>> > leave that for later.
>> >
>> > I'd appreciate your comments. Before I set to implement it, I'd like to
>> > know if the idea has any chances of making it to a commit :).
>> >
>> > Shai
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: IndexReader Cache - a different angle

Posted by Shai Erera <se...@gmail.com>.

Hey Simon,

You're right that the application can develop a Caching mechanism outside
Lucene, and when reopen() is called, if it changed, iterate on the
sub-readers and init the Cache w/ the new ones.

However, by building something like that inside Lucene, the application will
get more native support, and thus better performance, in some cases. For
example, consider a field fileType with 10 possible values, and for the sake
of simplicity, let's say that the index is divided evenly across them. Your
users always add such a term constraint to the query (e.g. they want to get
results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
others). You have basically two ways of supporting this:
(1) Add such a term to the query / clause to a BooleanQuery w/ an AND
relation -- cons is that this term / posting is read for every query.

(2) Write a Filter which works at the top IR level, that is refreshed
whenever the index is refreshed. This is better than (1), however has some
disadvantages:

(2.1) As Mike already proved (on some issue which I don't remember its
subject/number at the moment), if we could get Filter down to the lower
level components of Lucene's search, so e.g. it is used as the deleted docs
OBS, we can get better performance w/ Filters.

(2.2) The Filter is refreshed for the entire IR, and not just the changed
segments. Reason is, outside Collector, you have no way of telling
IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
Loading/refreshing the Filter may be expensive, and definitely won't perform
well w/ NRT, where by definition you'd like to get small changes searchable
very fast.

Therefore I think that if we could provide the necessary hooks in Lucene,
let's call it a Cache plug-in for now, we can incrementally improve the
search process. I don't want to go too far into the design of a generic
plug-ins mechanism, but you're right (again :)) -- we could offer a
reopen(PluginProvider) which is entirely not about Cache, and Cache would
become one of the Plugins the PluginProvider provides. I just try to learn
from past experience -- when the discussion is focused, there's a better
chance of getting to a resolution. However if you think that in this case, a
more generic API, as PluginProvider, would get us to a resolution faster, I
don't mind spend some time to think about it. But for all practical
purposes, we should IMO start w/ a Cache plug-in, that is called like that,
and if it catches, generify it ...

Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
so I can't comment on how feasible that solution is. I'll take your word for
it that it's doable :). But this doesn't give us a 3x solution ... the
Caching framework on trunk can be developed w/ Codecs.

Shai

On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> Hi Shai,
>
> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
> > many proposals to attack this problem, w/ no developed solution.
>
> I didn't go through those issues so forgive me if something I bring up
> has already been discussed.
> I have a couple of question about your proposal - please find them
> inline...
>
> >
> > I'd like to explore a different, IMO much simpler, angle to attach this
> > problem. Instead of having Lucene manage the Cache itself, we let the
> > application manage it, however Lucene will provide the necessary hooks
> > in IndexReader to allow it. The hooks I have in mind are:
> >
> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
> > already exists.
> >
> > (2) When reopen() is called, Lucene will take care to call a
> > Cache.load(IndexReader), so that the application can pull whatever
> > information
> > it needs from the passed-in IndexReader.
> Would that do anything else than passing the new reader (if reopened)
> to the caches load method? I wonder if this is more than
> If(newReader != oldReader)
>  Cache.load(newReader)
>
> If so something like that should be done on a segment reader anyway,
> right? From my perspective this isn't more than a callback or visitor
> that should walk though the subreaders and called for each reopened
> sub-reader. A cache-warming visitor / callback would then be trivial
> and the API would be more general.
>
>
> > So to be more concrete on my proposal, I'd like to support caching in
> > the following way (and while I've spent some time thinking about it, I'm
> > sure there are great suggestions to improve it):
> >
> > * Application provides a CacheFactory to IndexReader.open/reopen, which
> > exposes some very simple API, such as createCache, or
> > initCache(IndexReader) etc. Something which returns a Cache object,
> > which does not have very strict/concrete API.
>
> My first question would be why the reader should know about Cache if
> there is no strict / concrete API?
> I can follow you with the CacheFactory to create cache objects but why
> would the reader have to know / "receive" this object? Maybe this is
> answered further down the path but I don't see the reason why the
> notion of a "cache" must exist within open/reopen or if that could be
> implemented in a more general "cache" - agnostic way.
> >
> > * IndexReader, most probably at the SegmentReader level uses
> > CacheFactory to create a new Cache instance and calls its
> > load(IndexReader) method, so that the Cache would initialize itself.
> That is what I was thinking above - yet is that more than a callback
> for each reopened or opened segment reader?
>
> >
> > * The application can use CacheFactory to obtain the Cache object per
> > IndexReader (for example, during Collector.setNextReader), or we can
> > have IndexReader offer a getCache() method.
> :)  until here the cache is only used by the application itself not by
> any Lucene API, right? I can think of many application specific data
> that could be useful to be associated with an IR beyond the cacheing
> use case - again this could be a more general API solving that
> problem.
> >
> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
> > Object, or an interface CacheType w/ no methods, just to be a marker
> > one, and the application is free to impl it however it wants. That's a
> > loose API, I know, but completely at the application hands, which makes
> > Lucene code simpler.
> I like the idea together with the metadata associating functionality
> from above something like public T IndexReader#get(Type<T> type).
> Hmm that looks quiet similar to Attributes, does it?! :) However this
> could be done in many ways but again "cache" - agnositc
> >
> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
> > provide the user w/ IndexReader-similar API, only more efficient than
> > say TermDocs -- something w/ random access to the docs inside, perhaps
> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
> > CachingSegmentReader which makes use of the cache, and checks every time
> > termDocs() is called if the required Term is cached or not etc. I admit
> > I may be thinking too much ahead.
> I see what you are trying to do here. I also see how this could be
> useful but I guess coming up with a stable APi which serves lots of
> applications would be quiet hard. A CachingSegmentReader could be a
> very simple decorator which would not require to touch the IR
> interface. Something like that could be part of lucene but I'm not
> sure if necessarily part of lucene core.
>
> > That's more or less what I've been thinking. I'm sure there are many
> > details to iron out, but I hope I've managed to pass the general
> > proposal through to you.
>
> Absolutely, this is how it works isn't it!
>
> >
> > What I'm after first, is to allow applications deal w/ postings caching
> more
> > natively. For example, if you have a posting w/ payloads you'd like to
> > read into memory, or if you would like a term's TermDocs to be cached
> > (to be used as a Filter) etc. -- instead of writing something that can
> > work at the top IndexReader level, you'd be able to take advantage of
> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>
> I wonder if a custom codec would be the right place to implement
> caching / mem resident structures for Postings with payloads etc. You
> could do that on a higher level too but codec seems to be the way to
> go here, right?
> To utilize per segment capabilities a callback for (re)opened segment
> readers would be sufficient or do I miss something?
>
> simon
> >
> > I'm sure that after this will be in place, we can refactor FieldCache to
> > work w/ that API, perhaps as a Cache specific implementation. But I
> > leave that for later.
> >
> > I'd appreciate your comments. Before I set to implement it, I'd like to
> > know if the idea has any chances of making it to a commit :).
> >
> > Shai
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: IndexReader Cache - a different angle

Posted by Simon Willnauer <si...@googlemail.com>.

Hi Shai,

On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
> many proposals to attack this problem, w/ no developed solution.

I didn't go through those issues so forgive me if something I bring up
has already been discussed.
I have a couple of question about your proposal - please find them inline...

>
> I'd like to explore a different, IMO much simpler, angle to attach this
> problem. Instead of having Lucene manage the Cache itself, we let the
> application manage it, however Lucene will provide the necessary hooks
> in IndexReader to allow it. The hooks I have in mind are:
>
> (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
> already exists.
>
> (2) When reopen() is called, Lucene will take care to call a
> Cache.load(IndexReader), so that the application can pull whatever
> information
> it needs from the passed-in IndexReader.
Would that do anything else than passing the new reader (if reopened)
to the caches load method? I wonder if this is more than
If(newReader != oldReader)
  Cache.load(newReader)

If so something like that should be done on a segment reader anyway,
right? From my perspective this isn't more than a callback or visitor
that should walk though the subreaders and called for each reopened
sub-reader. A cache-warming visitor / callback would then be trivial
and the API would be more general.


> So to be more concrete on my proposal, I'd like to support caching in
> the following way (and while I've spent some time thinking about it, I'm
> sure there are great suggestions to improve it):
>
> * Application provides a CacheFactory to IndexReader.open/reopen, which
> exposes some very simple API, such as createCache, or
> initCache(IndexReader) etc. Something which returns a Cache object,
> which does not have very strict/concrete API.

My first question would be why the reader should know about Cache if
there is no strict / concrete API?
I can follow you with the CacheFactory to create cache objects but why
would the reader have to know / "receive" this object? Maybe this is
answered further down the path but I don't see the reason why the
notion of a "cache" must exist within open/reopen or if that could be
implemented in a more general "cache" - agnostic way.
>
> * IndexReader, most probably at the SegmentReader level uses
> CacheFactory to create a new Cache instance and calls its
> load(IndexReader) method, so that the Cache would initialize itself.
That is what I was thinking above - yet is that more than a callback
for each reopened or opened segment reader?

>
> * The application can use CacheFactory to obtain the Cache object per
> IndexReader (for example, during Collector.setNextReader), or we can
> have IndexReader offer a getCache() method.
:)  until here the cache is only used by the application itself not by
any Lucene API, right? I can think of many application specific data
that could be useful to be associated with an IR beyond the cacheing
use case - again this could be a more general API solving that
problem.
>
> * One of Cache API would be getCache(TYPE), where TYPE is a String or
> Object, or an interface CacheType w/ no methods, just to be a marker
> one, and the application is free to impl it however it wants. That's a
> loose API, I know, but completely at the application hands, which makes
> Lucene code simpler.
I like the idea together with the metadata associating functionality
from above something like public T IndexReader#get(Type<T> type).
Hmm that looks quiet similar to Attributes, does it?! :) However this
could be done in many ways but again "cache" - agnositc
>
> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
> provide the user w/ IndexReader-similar API, only more efficient than
> say TermDocs -- something w/ random access to the docs inside, perhaps
> even an OpenBitSet. Lucene can take advantage of it if, say, we create a
> CachingSegmentReader which makes use of the cache, and checks every time
> termDocs() is called if the required Term is cached or not etc. I admit
> I may be thinking too much ahead.
I see what you are trying to do here. I also see how this could be
useful but I guess coming up with a stable APi which serves lots of
applications would be quiet hard. A CachingSegmentReader could be a
very simple decorator which would not require to touch the IR
interface. Something like that could be part of lucene but I'm not
sure if necessarily part of lucene core.

> That's more or less what I've been thinking. I'm sure there are many
> details to iron out, but I hope I've managed to pass the general
> proposal through to you.

Absolutely, this is how it works isn't it!

>
> What I'm after first, is to allow applications deal w/ postings caching more
> natively. For example, if you have a posting w/ payloads you'd like to
> read into memory, or if you would like a term's TermDocs to be cached
> (to be used as a Filter) etc. -- instead of writing something that can
> work at the top IndexReader level, you'd be able to take advantage of
> Lucene internals, i.e. refresh the Cache only for the new segments ...

I wonder if a custom codec would be the right place to implement
caching / mem resident structures for Postings with payloads etc. You
could do that on a higher level too but codec seems to be the way to
go here, right?
To utilize per segment capabilities a callback for (re)opened segment
readers would be sufficient or do I miss something?

simon
>
> I'm sure that after this will be in place, we can refactor FieldCache to
> work w/ that API, perhaps as a Cache specific implementation. But I
> leave that for later.
>
> I'd appreciate your comments. Before I set to implement it, I'd like to
> know if the idea has any chances of making it to a commit :).
>
> Shai
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org