You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2009/06/10 22:13:55 UTC

Re: Lucene memory usage

Great! If I understand correctly it looks like RAM savings? Will
there be an improvement in lookup speed? (We're using binary
search here?).

Is there a precedence in database systems for what was mentioned
about placing the term dict, delDocs, and filters onto disk and
reading them from there (with the IO cache taking care of
keeping the data in RAM)? (Would there be a future advantage to
this approach when SSDs are more prevalent?) It seems like we
could have some generalized pluggable system where one could try
out this or the current heap approach, and benchmark.

Given our continued inability to properly measure Java RAM
usage, this approach may be a good one for Lucene? Where heap
based LRU caches are a shot in the dark when it comes to mem
size, as we never really know how much they're using.

Once we generalize delDocs, filters, and field caches
(LUCENE-831?), then perhaps CSF is a good place to test out this
approach? We could have a generic class that handles the
underlying IO that simply returns values based on a position or
iteration.

On Wed, Jun 10, 2009 at 11:26 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Roughly, the current approach for the default terms dict codec in
> LUCENE-1458 is:
>
>  * Create a separate class per-field (the String field in each Term
>    is redundant).  This is a big change over Lucene today....
>
>  * That class has String[] indexText and long[] indexPointer, each
>    length = the number of index terms.  No TermInfo instance nor Term
>    instance are used.
>
>  * Modify the tis format to also store its data by field
>
>  * Modify the tis format so that at a seek point (ie an indexed
>    term), absolute values are written for freq/prox pointer, but
>    continue to delta-code in between indexed terms.  EG this is how
>    video codecs work (every so often they write a "key frame" which
>    you can seek to & immediately decode w/ no prior context).
>
>  * tii then just stores text/long (delta coded) for all indexed
>    terms, and is slurped into the arrays on init.
>
> This is a sizable RAM savings over what's done now because you save 2
> objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term.
>
> Mike
>
> On Wed, Jun 10, 2009 at 2:02 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
> >> LUCENE-1458 (flexible indexing) has these improvements,
> >
> > Mike, can you explain how it's different?  I looked through the code once
> > but yeah, it's in with a lot of other changes.
> >
> > On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> This (very large number of unique terms) is a problem for Lucene
> currently.
> >>
> >> There are some simple improvements we could make to the terms dict
> >> format to not require so much RAM per term in the terms index...
> >> LUCENE-1458 (flexible indexing) has these improvements, but
> >> unfortunately tied in w/ lots of other changes.  Maybe we should break
> >> out a separate issue for this... this'd be a great contained
> >> improvement, if anyone out there has "the itch" :)
> >>
> >> One simple workaround is to call IndexReader.setTermIndexInterval
> >> immediately after opening the reader; this simply loads fewer terms in
> >> the index, using far less RAM, but at the expense of somewhat slower
> >> searching.
> >>
> >> Also: you should peek at your index, eg using Luke, to understand why
> >> you have so many terms.  It could be legitimate (indexing a massive
> >> catalog with eg part numbers), or, it could be your document filtering
> >> / analyzer are accidentally producing garbage terms.
> >>
> >> Mike
> >>
> >> On Wed, Jun 10, 2009 at 8:23 AM, Benedikt Boss<na...@web.de> wrote:
> >> > Hej hej,
> >> >
> >> > i have a question regarding lucenes memory usage
> >> > when launching a query. When i execute my query
> >> > lucene eats up over 1gig of heap-memory even
> >> > when my result-set is only a single hit. I
> >> > found out that this is due to the "ensureIndexIsRead()"
> >> > method-call in the "TermInfosReader" class, which
> >> > iterates over all Terms found in the index and saves
> >> > them (including all value-strings) in a Term-Array.
> >> > Is it possible to not read all that stuff
> >> > into memory at all?
> >> >
> >> > Im doing the query like in the following pseudo-code:
> >> >
> ------------------------------------------------------------------------
> >> >
> >> > TopScoreDocCollector collector = new TopScoreDocCollector(100000);
> >> >
> >> > QueryParser   parser= new QueryParser(field, new WhitespaceAnalyzer()
> );
> >> > Directory     fsDir = new FSDirectory(indexDir, null);
> >> > IndexSearcher is    = new IndexSearcher(fsdir);
> >> >
> >> > Query         query = parser.parse(q);
> >> >
> >> > is.search(query, collector);
> >> > ScoreDoc[] hits = collector.topDocs();
> >> >
> >> > ....... < iterate over hits and print results >
> >> >
> >> >
> >> > Thanks in advance
> >> > Benedikt
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Jun 11, 2009 at 4:30 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
>> Yes please post feature requests to Sun ;)
>
> I signed up for
> http://mail.openjdk.java.net/mailman/listinfo/nio-discuss

Looks like a fun list ;)

>> But I think in the short term Lucene will have to drop to
> native code to tell OS not to cache bytes read by segment
> merging...
>
> LUCENE-1121 uses transferTo which presumably doesn't run bytes
> through the IO cache? Granted it's slower on most platforms, but
> could this be fixed in future Java releases?

I fear transferTo may in fact burn through the IO cache... or, at
least, when I tested it in the past, on multi-GB files, I saw vmem
used by the process grow enormously... I think OS was choosing to use
massive chunks of RAM as the intermediate buffer (slurping lots of
bytes in, and then writing lots of bytes out), to minimize seeking.
We got mixed results on the performance gains on that issue, and some
(Windows Server 2003) were hideously slower.  I think the OS may have
been evicting dirty pages, or swapping app core out, or something, in
order to free up RAM.

We really need better control on what the OS does w/ IO and our RAM,
but we lack that from Java.  EG if I had the option I'd like to
provide the option to Lucene users to pin certain data structures
(like the terms index) to ensure the OS never swaps it out.  But we
just can't do that from Java...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene memory usage

Posted by Jason Rutherglen <ja...@gmail.com>.
> Yes please post feature requests to Sun ;)

I signed up for
http://mail.openjdk.java.net/mailman/listinfo/nio-discuss

> But I think in the short term Lucene will have to drop to
native code to tell OS not to cache bytes read by segment
merging...

LUCENE-1121 uses transferTo which presumably doesn't run bytes
through the IO cache? Granted it's slower on most platforms, but
could this be fixed in future Java releases?

On Thu, Jun 11, 2009 at 12:50 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Jun 11, 2009 at 3:21 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
> > Makes sense.
> >
> > Currently MMapDirectory doesn't write using mapped byte buffers,
> > would the memory management of the OS behave differently if we
> > were writing to the MMapped bytebuffers as opposed to writing to
> > an RAF (like with FSDir)?
>
> I would assume not, but would be good to confirm (Earwin, where's your
> improved MMapDir?).
>
> At the page level it's all basic LRU, and LRU is not a good policy for
> deciding which search data structures are best kept RAM resident.
>
> >> Well... locality is still important. Under the hood, mmap on a
> > page miss must hit the disk.
> >
> > Maybe this is where MappedByteBuffer.load as Earwin has
> > mantioned comes in handy?
>
> That's unfortunately a rather blunt tool, and only practical when
> available RAM exceeds the index size.  Warming your particular
> searches is more precise...
>
> Though..... RAM prices are so cheap these days that any "real"
> production search deployment should always aim to have the full index
> hot, in RAM.  Lucene really should provide a good impl for "RAM only"
> indexes.  RAMDirectory/MMapDir are not the answer, since Lucene is
> still using postings formats designed for single scan through a file
> that resides on disk where bytes consumed on disk are minimized (eg
> the VInt format is not CPU friendly).  We should start from
> contrib/instantiated and contrib/memory and iterate from there...
>
> > But yeah, we can't do anything with this unless we had a JNI
> > library that interacts more directly with the IO system
> > (allowing us to configure whether IO is cached etc), which
> > perhaps exists or could exist in the future (or Java7?).
>
> Yes please post feature requests to Sun ;)
>
> But I think in the short term Lucene will have to drop to native code
> to tell OS not to cache bytes read by segment merging...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Jun 11, 2009 at 3:21 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> Makes sense.
>
> Currently MMapDirectory doesn't write using mapped byte buffers,
> would the memory management of the OS behave differently if we
> were writing to the MMapped bytebuffers as opposed to writing to
> an RAF (like with FSDir)?

I would assume not, but would be good to confirm (Earwin, where's your
improved MMapDir?).

At the page level it's all basic LRU, and LRU is not a good policy for
deciding which search data structures are best kept RAM resident.

>> Well... locality is still important. Under the hood, mmap on a
> page miss must hit the disk.
>
> Maybe this is where MappedByteBuffer.load as Earwin has
> mantioned comes in handy?

That's unfortunately a rather blunt tool, and only practical when
available RAM exceeds the index size.  Warming your particular
searches is more precise...

Though..... RAM prices are so cheap these days that any "real"
production search deployment should always aim to have the full index
hot, in RAM.  Lucene really should provide a good impl for "RAM only"
indexes.  RAMDirectory/MMapDir are not the answer, since Lucene is
still using postings formats designed for single scan through a file
that resides on disk where bytes consumed on disk are minimized (eg
the VInt format is not CPU friendly).  We should start from
contrib/instantiated and contrib/memory and iterate from there...

> But yeah, we can't do anything with this unless we had a JNI
> library that interacts more directly with the IO system
> (allowing us to configure whether IO is cached etc), which
> perhaps exists or could exist in the future (or Java7?).

Yes please post feature requests to Sun ;)

But I think in the short term Lucene will have to drop to native code
to tell OS not to cache bytes read by segment merging...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene memory usage

Posted by Jason Rutherglen <ja...@gmail.com>.
Maybe we can put together our requested IO operations and submit them for
inclusion in NIO Java 7?  http://openjdk.java.net/projects/nio/

On Thu, Jun 11, 2009 at 12:21 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Makes sense.
>
> Currently MMapDirectory doesn't write using mapped byte buffers,
> would the memory management of the OS behave differently if we
> were writing to the MMapped bytebuffers as opposed to writing to
> an RAF (like with FSDir)?
>
> > it's blind LRU approach is often a poor policy (eg for terms
> dict, where a binary search could easily suddenly need to visit
> a random rarely accessed page).
>
> Agreed it's not the best for termDict.
>
> > Well... locality is still important. Under the hood, mmap on a
> page miss must hit the disk.
>
> Maybe this is where MappedByteBuffer.load as Earwin has
> mantioned comes in handy?
>
> But yeah, we can't do anything with this unless we had a JNI
> library that interacts more directly with the IO system
> (allowing us to configure whether IO is cached etc), which
> perhaps exists or could exist in the future (or Java7?).
>
>
>
> On Thu, Jun 11, 2009 at 2:43 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Wed, Jun 10, 2009 at 9:24 PM, Jason
>> Rutherglen<ja...@gmail.com> wrote:
>> > I read over the LUCENE-1458 comments again. Interesting. I think
>> > the most compelling argument is that the various files we're
>> > normally loading into the heap are, after merging, in the IO
>> > cache. If we can simply reuse the IO cache rather then allocate
>> > a bunch of redundant arrays in heap, we could be better off? I
>> > think this is very compelling for field caches, delDocs, and
>> > bitsets that are tied to segments and loaded after each merge.
>>
>> The OS doesn't have enough information to "know" what data structures
>> are important to Lucene (must stay hot) and which are less so.  It's
>> blind LRU approach is often a poor policy (eg for terms dict, where a
>> binary search could easily suddenly need to visit a random rarely
>> accessed page).
>>
>> For example, after merging, all the segments we just *read* from will
>> also be hot, having flushed out other important pages from the IO
>> cache, which is very much not what we want to do.  From C, and per-OS,
>> you can inform the OS that it should not cache the bytes read from the
>> file, but from Java we just can't control that.
>>
>> > I think it's possible to write some basic benchmarks to test a
>> > byte[] BitVector vs.a MappedByteBuffer BitVector and see what
>> > happens.
>>
>> Yes, but this is challenging to test properly.  On systems with plenty
>> of RAM, the approaches should be similarly fast.  On systems starved
>> for RAM, both approaches should thrash miserably.  It's the cases in
>> between that we need to test for.
>>
>> > The other potentially interesting angle here is in regards to
>> > realtime updates, where we can implement a MMaped page type of
>> > system so blocks of this stuff can be updated in near realtime,
>> > directly in the MMaped space (similar to how in heap land with
>> > LUCENE-1526 we're looking at breaking up the byte[] into a
>> > byte[][]).
>>
>> But carrying such updates via RAM, like we do now for deletions,
>> should generally be more performant (you never have to put the changes
>> on disk).
>>
>> > Also if we assume data is MMaped I don't think it matters as much if
>> > the updates on disk are not in sequence? (Whereas today we try
>> > to keep all our files sequentially readable optimized). Of
>> > course I could be completely wrong. :)
>>
>> Well... locality is still important.  Under the hood, mmap on a page
>> miss must hit the disk.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: Lucene memory usage

Posted by Jason Rutherglen <ja...@gmail.com>.
Makes sense.

Currently MMapDirectory doesn't write using mapped byte buffers,
would the memory management of the OS behave differently if we
were writing to the MMapped bytebuffers as opposed to writing to
an RAF (like with FSDir)?

> it's blind LRU approach is often a poor policy (eg for terms
dict, where a binary search could easily suddenly need to visit
a random rarely accessed page).

Agreed it's not the best for termDict.

> Well... locality is still important. Under the hood, mmap on a
page miss must hit the disk.

Maybe this is where MappedByteBuffer.load as Earwin has
mantioned comes in handy?

But yeah, we can't do anything with this unless we had a JNI
library that interacts more directly with the IO system
(allowing us to configure whether IO is cached etc), which
perhaps exists or could exist in the future (or Java7?).


On Thu, Jun 11, 2009 at 2:43 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Wed, Jun 10, 2009 at 9:24 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
> > I read over the LUCENE-1458 comments again. Interesting. I think
> > the most compelling argument is that the various files we're
> > normally loading into the heap are, after merging, in the IO
> > cache. If we can simply reuse the IO cache rather then allocate
> > a bunch of redundant arrays in heap, we could be better off? I
> > think this is very compelling for field caches, delDocs, and
> > bitsets that are tied to segments and loaded after each merge.
>
> The OS doesn't have enough information to "know" what data structures
> are important to Lucene (must stay hot) and which are less so.  It's
> blind LRU approach is often a poor policy (eg for terms dict, where a
> binary search could easily suddenly need to visit a random rarely
> accessed page).
>
> For example, after merging, all the segments we just *read* from will
> also be hot, having flushed out other important pages from the IO
> cache, which is very much not what we want to do.  From C, and per-OS,
> you can inform the OS that it should not cache the bytes read from the
> file, but from Java we just can't control that.
>
> > I think it's possible to write some basic benchmarks to test a
> > byte[] BitVector vs.a MappedByteBuffer BitVector and see what
> > happens.
>
> Yes, but this is challenging to test properly.  On systems with plenty
> of RAM, the approaches should be similarly fast.  On systems starved
> for RAM, both approaches should thrash miserably.  It's the cases in
> between that we need to test for.
>
> > The other potentially interesting angle here is in regards to
> > realtime updates, where we can implement a MMaped page type of
> > system so blocks of this stuff can be updated in near realtime,
> > directly in the MMaped space (similar to how in heap land with
> > LUCENE-1526 we're looking at breaking up the byte[] into a
> > byte[][]).
>
> But carrying such updates via RAM, like we do now for deletions,
> should generally be more performant (you never have to put the changes
> on disk).
>
> > Also if we assume data is MMaped I don't think it matters as much if
> > the updates on disk are not in sequence? (Whereas today we try
> > to keep all our files sequentially readable optimized). Of
> > course I could be completely wrong. :)
>
> Well... locality is still important.  Under the hood, mmap on a page
> miss must hit the disk.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 10, 2009 at 9:24 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> I read over the LUCENE-1458 comments again. Interesting. I think
> the most compelling argument is that the various files we're
> normally loading into the heap are, after merging, in the IO
> cache. If we can simply reuse the IO cache rather then allocate
> a bunch of redundant arrays in heap, we could be better off? I
> think this is very compelling for field caches, delDocs, and
> bitsets that are tied to segments and loaded after each merge.

The OS doesn't have enough information to "know" what data structures
are important to Lucene (must stay hot) and which are less so.  It's
blind LRU approach is often a poor policy (eg for terms dict, where a
binary search could easily suddenly need to visit a random rarely
accessed page).

For example, after merging, all the segments we just *read* from will
also be hot, having flushed out other important pages from the IO
cache, which is very much not what we want to do.  From C, and per-OS,
you can inform the OS that it should not cache the bytes read from the
file, but from Java we just can't control that.

> I think it's possible to write some basic benchmarks to test a
> byte[] BitVector vs.a MappedByteBuffer BitVector and see what
> happens.

Yes, but this is challenging to test properly.  On systems with plenty
of RAM, the approaches should be similarly fast.  On systems starved
for RAM, both approaches should thrash miserably.  It's the cases in
between that we need to test for.

> The other potentially interesting angle here is in regards to
> realtime updates, where we can implement a MMaped page type of
> system so blocks of this stuff can be updated in near realtime,
> directly in the MMaped space (similar to how in heap land with
> LUCENE-1526 we're looking at breaking up the byte[] into a
> byte[][]).

But carrying such updates via RAM, like we do now for deletions,
should generally be more performant (you never have to put the changes
on disk).

> Also if we assume data is MMaped I don't think it matters as much if
> the updates on disk are not in sequence? (Whereas today we try
> to keep all our files sequentially readable optimized). Of
> course I could be completely wrong. :)

Well... locality is still important.  Under the hood, mmap on a page
miss must hit the disk.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene memory usage

Posted by Jason Rutherglen <ja...@gmail.com>.
I read over the LUCENE-1458 comments again. Interesting. I think
the most compelling argument is that the various files we're
normally loading into the heap are, after merging, in the IO
cache. If we can simply reuse the IO cache rather then allocate
a bunch of redundant arrays in heap, we could be better off? I
think this is very compelling for field caches, delDocs, and
bitsets that are tied to segments and loaded after each merge.

I think it's possible to write some basic benchmarks to test a
byte[] BitVector vs.a MappedByteBuffer BitVector and see what
happens.

The other potentially interesting angle here is in regards to
realtime updates, where we can implement a MMaped page type of
system so blocks of this stuff can be updated in near realtime,
directly in the MMaped space (similar to how in heap land with
LUCENE-1526 we're looking at breaking up the byte[] into a
byte[][]).

Also if we assume data is MMaped I don't think it matters as much if
the updates on disk are not in sequence? (Whereas today we try
to keep all our files sequentially readable optimized). Of
course I could be completely wrong. :)

On Wed, Jun 10, 2009 at 5:19 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Wed, Jun 10, 2009 at 7:23 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
> > Cool! Sounds like with LUCENE-1458 we can experiment with some
> > of these things. Does CSF become just another codec?
>
> I believe LUCENE-1458 currently only makes terms dict & postings
> pluggable...
>
> >> I'm leary of having terms dict live entirely on disk, though
> > we should certainly explore it.
> >
> > Yeah, it should theoretically help with reloading, it could use
> > a skiplist (as we have a disk version of that implemented)
> > instead of binarysearch). It seems like with things like
> > TrieRange (which potentially adds many fields and terms) it
> > could be useful to let the IO cache calculate what we need in
> > RAM and what we don't, otherwise we're constantly at risk of
> > exceeding heap usage. There'll be other potential RAM issues
> > (such as page faults), but it seems like users will constantly
> > be up against the inability to precalculate Java heap usage of
> > data structures (whereas file based data usage can be measured).
> > Norms are another example, and with flexible indexing (and
> > scoring?) there may be additional fields the user may want to
> > change dynamically, that if completely loaded into heap cause
> > OOM problems.
> >
> > I guess I personally think it would be great to not worry about
> > exceeding heap with Lucene apps (as it's a guessing game), and
> > then one can simply analyze the OS level IO cache/swap space to
> > see if the app could slow down due to the machine not having
> > enough RAM. I think this would remove one of the major
> > differences between a Java based search engine and a C++ based
> > one.
>
> Marvin and I discussed this quite a bit already in LUCENE-1458... we
> should make it pluggable and then try both -- let the machine tell
> us ;)
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 10, 2009 at 7:23 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> Cool! Sounds like with LUCENE-1458 we can experiment with some
> of these things. Does CSF become just another codec?

I believe LUCENE-1458 currently only makes terms dict & postings
pluggable...

>> I'm leary of having terms dict live entirely on disk, though
> we should certainly explore it.
>
> Yeah, it should theoretically help with reloading, it could use
> a skiplist (as we have a disk version of that implemented)
> instead of binarysearch). It seems like with things like
> TrieRange (which potentially adds many fields and terms) it
> could be useful to let the IO cache calculate what we need in
> RAM and what we don't, otherwise we're constantly at risk of
> exceeding heap usage. There'll be other potential RAM issues
> (such as page faults), but it seems like users will constantly
> be up against the inability to precalculate Java heap usage of
> data structures (whereas file based data usage can be measured).
> Norms are another example, and with flexible indexing (and
> scoring?) there may be additional fields the user may want to
> change dynamically, that if completely loaded into heap cause
> OOM problems.
>
> I guess I personally think it would be great to not worry about
> exceeding heap with Lucene apps (as it's a guessing game), and
> then one can simply analyze the OS level IO cache/swap space to
> see if the app could slow down due to the machine not having
> enough RAM. I think this would remove one of the major
> differences between a Java based search engine and a C++ based
> one.

Marvin and I discussed this quite a bit already in LUCENE-1458... we
should make it pluggable and then try both -- let the machine tell
us ;)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene memory usage

Posted by Jason Rutherglen <ja...@gmail.com>.
Cool! Sounds like with LUCENE-1458 we can experiment with some
of these things. Does CSF become just another codec?

> I'm leary of having terms dict live entirely on disk, though
we should certainly explore it.

Yeah, it should theoretically help with reloading, it could use
a skiplist (as we have a disk version of that implemented)
instead of binarysearch). It seems like with things like
TrieRange (which potentially adds many fields and terms) it
could be useful to let the IO cache calculate what we need in
RAM and what we don't, otherwise we're constantly at risk of
exceeding heap usage. There'll be other potential RAM issues
(such as page faults), but it seems like users will constantly
be up against the inability to precalculate Java heap usage of
data structures (whereas file based data usage can be measured).
Norms are another example, and with flexible indexing (and
scoring?) there may be additional fields the user may want to
change dynamically, that if completely loaded into heap cause
OOM problems.

I guess I personally think it would be great to not worry about
exceeding heap with Lucene apps (as it's a guessing game), and
then one can simply analyze the OS level IO cache/swap space to
see if the app could slow down due to the machine not having
enough RAM. I think this would remove one of the major
differences between a Java based search engine and a C++ based
one.

On Wed, Jun 10, 2009 at 1:26 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Wed, Jun 10, 2009 at 4:13 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
> > Great! If I understand correctly it looks like RAM savings? Will
> > there be an improvement in lookup speed? (We're using binary
> > search here?).
>
> Yes, sizable RAM reduction for apps that have many unique terms.  And,
> init'ing (warming) the reader should be faster.
>
> Lookup speed should be faster (binary search against the terms in a
> single field, not all terms).
>
> > Is there a precedence in database systems for what was mentioned
> > about placing the term dict, delDocs, and filters onto disk and
> > reading them from there (with the IO cache taking care of
> > keeping the data in RAM)? (Would there be a future advantage to
> > this approach when SSDs are more prevalent?) It seems like we
> > could have some generalized pluggable system where one could try
> > out this or the current heap approach, and benchmark.
>
> LUCENE-1458 creates exactly such a pluggable system.  Ie it's lets you
> swap in your own codec for terms, freq, prox, etc.
>
> But: I'm leary of having terms dict live entirely on disk, though we
> should certainly explore it.
>
> > Given our continued inability to properly measure Java RAM
> > usage, this approach may be a good one for Lucene? Where heap
> > based LRU caches are a shot in the dark when it comes to mem
> > size, as we never really know how much they're using.
>
> Well remember mmap uses an LRU policy to decide when pages are swapped
> to disk... so a search that's unlucky can easily hit many page faults
> just in consulting the terms dict.  You could be at 200 msec cost
> before you even hit a postings list... I prefer to have the terms
> index RAM resident (of course the OS can still swap THAT out too...).
>
> > Once we generalize delDocs, filters, and field caches
> > (LUCENE-831?), then perhaps CSF is a good place to test out this
> > approach? We could have a generic class that handles the
> > underlying IO that simply returns values based on a position or
> > iteration.
>
> I agree, a CSF codec that uses mmap seems like a good place to
> start...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 10, 2009 at 4:13 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> Great! If I understand correctly it looks like RAM savings? Will
> there be an improvement in lookup speed? (We're using binary
> search here?).

Yes, sizable RAM reduction for apps that have many unique terms.  And,
init'ing (warming) the reader should be faster.

Lookup speed should be faster (binary search against the terms in a
single field, not all terms).

> Is there a precedence in database systems for what was mentioned
> about placing the term dict, delDocs, and filters onto disk and
> reading them from there (with the IO cache taking care of
> keeping the data in RAM)? (Would there be a future advantage to
> this approach when SSDs are more prevalent?) It seems like we
> could have some generalized pluggable system where one could try
> out this or the current heap approach, and benchmark.

LUCENE-1458 creates exactly such a pluggable system.  Ie it's lets you
swap in your own codec for terms, freq, prox, etc.

But: I'm leary of having terms dict live entirely on disk, though we
should certainly explore it.

> Given our continued inability to properly measure Java RAM
> usage, this approach may be a good one for Lucene? Where heap
> based LRU caches are a shot in the dark when it comes to mem
> size, as we never really know how much they're using.

Well remember mmap uses an LRU policy to decide when pages are swapped
to disk... so a search that's unlucky can easily hit many page faults
just in consulting the terms dict.  You could be at 200 msec cost
before you even hit a postings list... I prefer to have the terms
index RAM resident (of course the OS can still swap THAT out too...).

> Once we generalize delDocs, filters, and field caches
> (LUCENE-831?), then perhaps CSF is a good place to test out this
> approach? We could have a generic class that handles the
> underlying IO that simply returns values based on a position or
> iteration.

I agree, a CSF codec that uses mmap seems like a good place to
start...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org