You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Britske <gb...@gmail.com> on 2008/04/17 14:14:36 UTC

Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

I've seen some recent activity on LUCENE-831 "Complete overhaul of FieldCache
API" and read that it must be able to cleanly patch to trunk (haven't tried
yet). 

What I'd like to know from people involved is if this patch incorporates
offloading of fieldcache to disk, or if this hasn't yet been taken into
account. As far as I can follow it, this was one of the initial intentions. 

Thanks,
Britske
-- 
View this message in context: http://www.nabble.com/Does-LUCENE-831%29-%22Complete-overhaul-of-FieldCache-API%22-provide-fieldcache-offloading-to-disk--tp16743559p16743559.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Ahh woops sorry I didn't look at the latest patch on LUCENE-831 just  
yet.  Thanks!  That's great.

Mike

Mark Miller wrote:
> Right...that is what the latest patch I put up does (Hoss basically
> stubbed it all out to be ready for this).
>
> Each SegmentReader has its own cache. Each MultiReader can have its  
> own
> cache as well (in the case that you want a primitive array), but if  
> you
> can take an ObjectArray object instead, the MultiReader returns an
> ObjectArray that distributes doc i requests to the appropriate
> ObjectArray owned by a SegmentReader (the ObjectArray returned from  
> the
> SegmentReaders cache). This was done by your suggestion :) And man  
> is it
> fast in some cases.
>
> It works amazing well if you reopen a lot and have a lax merge
> factor...just as you say, often you only have to load a tiny new  
> segment
> when reloading the field cache...on average its tons faster to reopen.
>
> - Mark
>
> On Thu, 2008-04-17 at 14:54 -0400, Michael McCandless wrote:
>> Mark Miller wrote:
>>
>>> I think your 2 readers question is interesting and I will certainly
>>> think about it. Right now though, each IndexReader instance holds
>>> it own
>>> cache. I'll have to dig back into the code and see about possibly
>>> keying
>>> on the directory or something?
>>
>> I think, with how IndexReader.reopen() now works, we should switch to
>> somehow having the FieldCache "attached" to each SegmentReader  
>> instead
>> of stored globally keyed by the top MultiSegmentReader.
>>
>> This way if we do a reopen and say the only change to the index  
>> was 10
>> added docs then the only new FieldCache that gets created is that
>> length 10 array (because only that SegmentReader will be new).
>>
>> But then the FieldCache is just starting to feel alot like column-
>> stride fields
>> (LUCENE-1231).
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Mark Miller <ma...@gmail.com>.
Right...that is what the latest patch I put up does (Hoss basically
stubbed it all out to be ready for this).

Each SegmentReader has its own cache. Each MultiReader can have its own
cache as well (in the case that you want a primitive array), but if you
can take an ObjectArray object instead, the MultiReader returns an
ObjectArray that distributes doc i requests to the appropriate
ObjectArray owned by a SegmentReader (the ObjectArray returned from the
SegmentReaders cache). This was done by your suggestion :) And man is it
fast in some cases.

It works amazing well if you reopen a lot and have a lax merge
factor...just as you say, often you only have to load a tiny new segment
when reloading the field cache...on average its tons faster to reopen.

- Mark

On Thu, 2008-04-17 at 14:54 -0400, Michael McCandless wrote:
> Mark Miller wrote:
> 
> > I think your 2 readers question is interesting and I will certainly
> > think about it. Right now though, each IndexReader instance holds  
> > it own
> > cache. I'll have to dig back into the code and see about possibly  
> > keying
> > on the directory or something?
> 
> I think, with how IndexReader.reopen() now works, we should switch to
> somehow having the FieldCache "attached" to each SegmentReader instead
> of stored globally keyed by the top MultiSegmentReader.
> 
> This way if we do a reopen and say the only change to the index was 10
> added docs then the only new FieldCache that gets created is that
> length 10 array (because only that SegmentReader will be new).
> 
> But then the FieldCache is just starting to feel alot like column- 
> stride fields
> (LUCENE-1231).
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael Busch <bu...@gmail.com>.
Michael McCandless wrote:
> 
> OK so in this approach, a CSF is an "on disk" format, while the 
> FieldCache represents loading all (or maybe eventually subsets as 
> controlled by a cache policy) into a memory cache.  And since they both 
> implement FSV you can swap either in when you need it.
> 

Yes, exactly.

-Michael


> This sounds good!
> 
> Mike
> 
> Michael Busch wrote:
>> Chris Hostetter wrote:
>>> : But then the FieldCache is just starting to feel alot like 
>>> column-stride
>>> : fields
>>> : (LUCENE-1231).
>>> that's what i've been thinking ... my goal with LUCENE-831 was to 
>>> make it easier to manage FieldCache and hopefully the norms[] as well 
>>> particularly in the case of reopen ... but with column-stride fields 
>>> the need for both of those might go away completely)
>>>
>>
>> (moved to java-dev, java-user cc'd)
>>
>> My goal is it not to get rid of the FieldCache by adding column-stride 
>> fields (CSF), but instead to make them the default source for the 
>> FieldCache.
>>
>> We should introduce an interface, named maybe FieldValueSource, that 
>> the new FieldCache implements, and also the CSF API. That has some 
>> advantages:
>> - Norms can be stored as CSF, and can be accessed using the 
>> FieldValueSource API. Then we can easily add an option to IndexReader 
>> whether to cache norms in memory (i. e. the new FieldCache) or not. 
>> When users have huge indexes on 32bit machines, where the norms would 
>> consume too much memory, they can disable caching them, of course 
>> search performance will suffer (but that's better than OutOfMemoryErrors)
>> - The function queries can use the FieldValueSource interface to 
>> retrieve the values (allowing us to get rid of function/ValueSource).
>> - Any consumer of the FieldValueSource does not have to care about 
>> whether or not values are cached and how. If performance is too slow 
>> and memory permits, caching can be enabled very easily.
>> - We will still support loading the fieldcache from the dictionary for 
>> backwards compatibility, but we should think about deprecating this 
>> and eventually get rid of it. We probably shouldn't add an 
>> implementation of FieldValueSource that reads from the dictionary, 
>> because performance would be terrible in the non-cached mode.
>>
>> -Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK so in this approach, a CSF is an "on disk" format, while the  
FieldCache represents loading all (or maybe eventually subsets as  
controlled by a cache policy) into a memory cache.  And since they  
both implement FSV you can swap either in when you need it.

This sounds good!

Mike

Michael Busch wrote:
> Chris Hostetter wrote:
>> : But then the FieldCache is just starting to feel alot like  
>> column-stride
>> : fields
>> : (LUCENE-1231).
>> that's what i've been thinking ... my goal with LUCENE-831 was to  
>> make it easier to manage FieldCache and hopefully the norms[] as  
>> well particularly in the case of reopen ... but with column-stride  
>> fields the need for both of those might go away completely)
>>
>
> (moved to java-dev, java-user cc'd)
>
> My goal is it not to get rid of the FieldCache by adding column- 
> stride fields (CSF), but instead to make them the default source  
> for the FieldCache.
>
> We should introduce an interface, named maybe FieldValueSource,  
> that the new FieldCache implements, and also the CSF API. That has  
> some advantages:
> - Norms can be stored as CSF, and can be accessed using the  
> FieldValueSource API. Then we can easily add an option to  
> IndexReader whether to cache norms in memory (i. e. the new  
> FieldCache) or not. When users have huge indexes on 32bit machines,  
> where the norms would consume too much memory, they can disable  
> caching them, of course search performance will suffer (but that's  
> better than OutOfMemoryErrors)
> - The function queries can use the FieldValueSource interface to  
> retrieve the values (allowing us to get rid of function/ValueSource).
> - Any consumer of the FieldValueSource does not have to care about  
> whether or not values are cached and how. If performance is too  
> slow and memory permits, caching can be enabled very easily.
> - We will still support loading the fieldcache from the dictionary  
> for backwards compatibility, but we should think about deprecating  
> this and eventually get rid of it. We probably shouldn't add an  
> implementation of FieldValueSource that reads from the dictionary,  
> because performance would be terrible in the non-cached mode.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael Busch <bu...@gmail.com>.
Chris Hostetter wrote:
> : But then the FieldCache is just starting to feel alot like column-stride
> : fields
> : (LUCENE-1231).
> 
> that's what i've been thinking ... my goal with LUCENE-831 was to make it 
> easier to manage FieldCache and hopefully the norms[] as well particularly 
> in the case of reopen ... but with column-stride fields the need for both 
> of those might go away completely)
>

(moved to java-dev, java-user cc'd)

My goal is it not to get rid of the FieldCache by adding column-stride 
fields (CSF), but instead to make them the default source for the 
FieldCache.

We should introduce an interface, named maybe FieldValueSource, that the 
new FieldCache implements, and also the CSF API. That has some advantages:
- Norms can be stored as CSF, and can be accessed using the 
FieldValueSource API. Then we can easily add an option to IndexReader 
whether to cache norms in memory (i. e. the new FieldCache) or not. When 
users have huge indexes on 32bit machines, where the norms would consume 
too much memory, they can disable caching them, of course search 
performance will suffer (but that's better than OutOfMemoryErrors)
- The function queries can use the FieldValueSource interface to 
retrieve the values (allowing us to get rid of function/ValueSource).
- Any consumer of the FieldValueSource does not have to care about 
whether or not values are cached and how. If performance is too slow and 
memory permits, caching can be enabled very easily.
- We will still support loading the fieldcache from the dictionary for 
backwards compatibility, but we should think about deprecating this and 
eventually get rid of it. We probably shouldn't add an implementation of 
FieldValueSource that reads from the dictionary, because performance 
would be terrible in the non-cached mode.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael Busch <bu...@gmail.com>.
Chris Hostetter wrote:
> : But then the FieldCache is just starting to feel alot like column-stride
> : fields
> : (LUCENE-1231).
> 
> that's what i've been thinking ... my goal with LUCENE-831 was to make it 
> easier to manage FieldCache and hopefully the norms[] as well particularly 
> in the case of reopen ... but with column-stride fields the need for both 
> of those might go away completely)
>

(moved to java-dev, java-user cc'd)

My goal is it not to get rid of the FieldCache by adding column-stride 
fields (CSF), but instead to make them the default source for the 
FieldCache.

We should introduce an interface, named maybe FieldValueSource, that the 
new FieldCache implements, and also the CSF API. That has some advantages:
- Norms can be stored as CSF, and can be accessed using the 
FieldValueSource API. Then we can easily add an option to IndexReader 
whether to cache norms in memory (i. e. the new FieldCache) or not. When 
users have huge indexes on 32bit machines, where the norms would consume 
too much memory, they can disable caching them, of course search 
performance will suffer (but that's better than OutOfMemoryErrors)
- The function queries can use the FieldValueSource interface to 
retrieve the values (allowing us to get rid of function/ValueSource).
- Any consumer of the FieldValueSource does not have to care about 
whether or not values are cached and how. If performance is too slow and 
memory permits, caching can be enabled very easily.
- We will still support loading the fieldcache from the dictionary for 
backwards compatibility, but we should think about deprecating this and 
eventually get rid of it. We probably shouldn't add an implementation of 
FieldValueSource that reads from the dictionary, because performance 
would be terrible in the non-cached mode.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Chris Hostetter <ho...@fucit.org>.
: But then the FieldCache is just starting to feel alot like column-stride
: fields
: (LUCENE-1231).

that's what i've been thinking ... my goal with LUCENE-831 was to make it 
easier to manage FieldCache and hopefully the norms[] as well particularly 
in the case of reopen ... but with column-stride fields the need for both 
of those might go away completely)

that doesn't mean LUCENE-1231 won't still be usefull .. it could probably 
still be leveraged by things like CachingWrapperFilter, and some of the 
Solr caches to reduce the amount of work on reload -- i just don't know 
that FieldCache will really need to exist in the future.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Mark Miller wrote:

> I think your 2 readers question is interesting and I will certainly
> think about it. Right now though, each IndexReader instance holds  
> it own
> cache. I'll have to dig back into the code and see about possibly  
> keying
> on the directory or something?

I think, with how IndexReader.reopen() now works, we should switch to
somehow having the FieldCache "attached" to each SegmentReader instead
of stored globally keyed by the top MultiSegmentReader.

This way if we do a reopen and say the only change to the index was 10
added docs then the only new FieldCache that gets created is that
length 10 array (because only that SegmentReader will be new).

But then the FieldCache is just starting to feel alot like column- 
stride fields
(LUCENE-1231).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Mark Miller <ma...@gmail.com>.
Yeah, yeah, you are def right...if you have field caches larger than
your RAM, you can def spill off to HD. I just wonder if your going to
get performance that is acceptable if you are actually using all of
those fieldcaches and have to go to disk a lot. It would be awesome to
know how that works though...I was very interested in it, but have not
had the time to get together enough data and what not for some good
testing. Kind of fell off my priorities...

I think your 2 readers question is interesting and I will certainly
think about it. Right now though, each IndexReader instance holds it own
cache. I'll have to dig back into the code and see about possibly keying
on the directory or something?

Then again, Karl's latest issue may help make 2 readers lose some if its
advantage: https://issues.apache.org/jira/browse/LUCENE-1265 so it may
not be wise to go out of the way to support that use case.

Also, keep in mind that this code may not end up in the results of this
issue at all. I basically just put it out there to demonstrate the kind
of advantage you can get in reopen speed with a large field cache. Hoss
did a great job on the API though, so whoever actually hammers this out
may stick with a lot of it.

Who knows...if you report back with some numbers, maybe youll influence
how things go <g>.

- Mark

On Thu, 2008-04-17 at 10:25 -0700, Britske wrote:
> The obstacle I'm seeing is that I have a lot of fields which use sorting.
> Sooner or later this will give an OutOfMem-error since the field-cache grows
> too large. Am i correct in assuming that implementing for instance a EHCache
> with flush-to-disk would solve this issue?  (With a tradeoff for performance
> of course)
> 
> Moreover, when warming readers with the patch, thus having 2 reader open at
> the same time (I am using solr searchers btw, but I guess these use the same
> underlying lucene-code, I'll have to check) can these 2 readers shares the
> same fieldcache and thus eliminate the required double memory  while
> warming? 
> 
> Thanks.
> 
> 
> markrmiller wrote:
> > 
> > It does not specifically incorporate caching to disk, but what it does
> > do is easily allow you to provide a new Cache implementation. The
> > default implementation is just a simple in memory Map, but its trivial
> > to provide a new implementation using something like EHCache to back the
> > Cache implementation.
> > 
> > I don't know if caching to disk will really be that much of a benefit,
> > so if you play around I would love to hear your results.
> > 
> > The big benefit is
> > if you are reopening Readers with field caches, it can be waaay faster.
> > 
> > 
> > - Mark
> > 
> > On Thu, 2008-04-17 at 05:14 -0700, Britske wrote:
> >> I've seen some recent activity on LUCENE-831 "Complete overhaul of
> >> FieldCache
> >> API" and read that it must be able to cleanly patch to trunk (haven't
> >> tried
> >> yet). 
> >> 
> >> What I'd like to know from people involved is if this patch incorporates
> >> offloading of fieldcache to disk, or if this hasn't yet been taken into
> >> account. As far as I can follow it, this was one of the initial
> >> intentions. 
> >> 
> >> Thanks,
> >> Britske
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Britske <gb...@gmail.com>.
The obstacle I'm seeing is that I have a lot of fields which use sorting.
Sooner or later this will give an OutOfMem-error since the field-cache grows
too large. Am i correct in assuming that implementing for instance a EHCache
with flush-to-disk would solve this issue?  (With a tradeoff for performance
of course)

Moreover, when warming readers with the patch, thus having 2 reader open at
the same time (I am using solr searchers btw, but I guess these use the same
underlying lucene-code, I'll have to check) can these 2 readers shares the
same fieldcache and thus eliminate the required double memory  while
warming? 

Thanks.


markrmiller wrote:
> 
> It does not specifically incorporate caching to disk, but what it does
> do is easily allow you to provide a new Cache implementation. The
> default implementation is just a simple in memory Map, but its trivial
> to provide a new implementation using something like EHCache to back the
> Cache implementation.
> 
> I don't know if caching to disk will really be that much of a benefit,
> so if you play around I would love to hear your results.
> 
> The big benefit is
> if you are reopening Readers with field caches, it can be waaay faster.
> 
> 
> - Mark
> 
> On Thu, 2008-04-17 at 05:14 -0700, Britske wrote:
>> I've seen some recent activity on LUCENE-831 "Complete overhaul of
>> FieldCache
>> API" and read that it must be able to cleanly patch to trunk (haven't
>> tried
>> yet). 
>> 
>> What I'd like to know from people involved is if this patch incorporates
>> offloading of fieldcache to disk, or if this hasn't yet been taken into
>> account. As far as I can follow it, this was one of the initial
>> intentions. 
>> 
>> Thanks,
>> Britske
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Does-LUCENE-831%29-%22Complete-overhaul-of-FieldCache-API%22-provide-fieldcache-offloading-to-disk--tp16743559p16747733.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does LUCENE-831) "Complete overhaul of FieldCache API" provide fieldcache offloading to disk?

Posted by Mark Miller <ma...@gmail.com>.
It does not specifically incorporate caching to disk, but what it does
do is easily allow you to provide a new Cache implementation. The
default implementation is just a simple in memory Map, but its trivial
to provide a new implementation using something like EHCache to back the
Cache implementation.

I don't know if caching to disk will really be that much of a benefit,
so if you play around I would love to hear your results.

The big benefit is
if you are reopening Readers with field caches, it can be waaay faster.


- Mark

On Thu, 2008-04-17 at 05:14 -0700, Britske wrote:
> I've seen some recent activity on LUCENE-831 "Complete overhaul of FieldCache
> API" and read that it must be able to cleanly patch to trunk (haven't tried
> yet). 
> 
> What I'd like to know from people involved is if this patch incorporates
> offloading of fieldcache to disk, or if this hasn't yet been taken into
> account. As far as I can follow it, this was one of the initial intentions. 
> 
> Thanks,
> Britske


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org