You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/02/01 20:52:45 UTC

Index-time RAM consumption settings (was Invalid UTF-8)

On Wed, Jan 27, 2010 at 10:43:22PM -0600, Peter Karman wrote:

> Is there, or any plan to, make the DEFAULT_MEM_THRESH alterable at runtime? 

I've made it settable privately so that we could go back to simulating large
indexes within the test suite. But as a public API?  

Well, here's the problem.  It's an implementation detail, specific to
PostingListWriter.  I'm just about to add another, separate SortExternal pool
in SortWriter, which will have its own threshold at which it flushes runs to
disk.  More generally, arbitrary index components added using custom
Architectures might have their own pools and their own thresholds.  How would
setting a default memory threshold for one affect the others?

I don't think it makes sense to expose any of those thresholds specifically.
Lucene has historically exposed all kinds of extra optimization settings via
IndexWriter, which go stale as the underlying implementation changes, bloating
IndexWriter's API and causing confusion:

  setMergeFactor()
  setMaxMergeDocs() 
  setMaxBufferedDocs() 
  setMergePolicy() 
  setMergeScheduler() 
  setRAMBufferSizeMB()
  
And so on.  I think that's sub-optimal design for a number of reasons, and I
think it's important that Lucy *not* go down the same road.

> I'm assuming that in situations where available ram is low, it would be
> helpful to trade-off speed for memory by setting the threshold lower and
> flushing to disk more often. Is that a realistic assumption?

If we were to do something like that, it would be one dial, and instead of
Indexer it would go into IndexManager, where we hide all expert per-session
settings.  Rather than an absolute number, it would be a float multiplier
defaulting to 1.0 which all index components would have the option of
consulting.  PostingListWriter would use it to scale its memory threshold.

However, it would not cap memory usage.  It wouldn't be like specifying a JVM
heap size.  And performance will still depend to a large extent on the size of
the index and the RAM installed in the machine, since speed will dive if our
temp files get ejected from the IO cache.

FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
relatively parsimonious with process RAM.

Marvin Humphrey


Re: [Lucy] Index-time RAM consumption settings (was Invalid UTF-8)

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 2/1/10 1:52 PM:

> FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
> relatively parsimonious with process RAM.

if that proves true, I think it's a non-issue.

I only raise the flag because in Xapian there's such a dial, an env var setting
a flush threshold, that is often mentioned as a way to control indexing speed
vs. memory use. From what I've seen of KS the indexing speed is much faster and
mem use much lower anyway, so I'm not worrying about it.

Thanks for the detailed reply re: the issues involved.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com