You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2008/11/16 00:38:34 UTC

Optimizing InStream for mmap

Greets,

In commits r3895 - r3925 to the KinoSearch repository, InStream has been
optimized for internal use of mmap() on Unixen.  

  * On 32-bit Unixen, InStream provides access to the file data via a variable
    width "sliding window". The window is opened and closed using continuous
    calls to mmap() and munmap().
  * On systems without sys/mman.h (e.g. Windows), we fall back to using
    a malloc'd buffer and sequential reads to fake up a sliding window.
  * On 64-bit Unixen, mmap() only gets called once, at object creation time.
    There's no need for a sliding window.

For optimum performance under 64-bit Unixen, client code can request a
window the width of the entire file:

  Foo*
  Foo_new(InStream *instream)
  {
    Foo   *self    = (Foo*)CREATE(NULL, FOO);
    i64_t  len     = InStream_Length(instream);
    self->buf      = InStream_Buf(instream, len); /* map whole file */
    self->limit    = buf + len;
    self->instream = REFCOUNT_INC(instream);
    return self;
  }

Such code would work fine for small files on 32-bit systems.  Large files,
however, would cause such systems to blow up, either by exceeding addressable
space and causing mmap() to fail, or, for systems without mmap(), through
excessive memory consumption.

To be portable to 32-bit systems, core modules will have to avoid mapping
large files.  If we want to max out the performance of PostingLists and
Lexicons on 64-bit systems, that means we'll have to accept the increased
maintenance burden of providing two different behaviors.  I don't think the
burden will be too heavy, though.  

Marvin Humphrey


Re: Optimizing InStream for mmap

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Nov 17, 2008 at 09:56:59AM -0800, Marvin Humphrey wrote:
> I'd really like to have zero cache loads on Searcher startup, which means
> changing how the Lexicon indexes work.

Discussion of lexicon format today at:
<https://issues.apache.org/jira/browse/LUCENE-1458>.

Marvin Humphrey


Re: Optimizing InStream for mmap

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Nov 16, 2008 at 08:38:09PM -0600, Peter Karman wrote:
> How big an issue is 32-bit support? 

I don't think it's that big a deal.

There are 4 main index reading components in core.  DocReader and
TermVectorsReader are very similar to each other, and neither would benefit
significantly from refactoring for mmap, either in terms of simplicity or
performance.  Each uses two files per segment: a data stream and an index
stream.  There's no point in mapping the data stream.  Since the index stream
is just a stack of 64-bit integers, we could map it and access it as an array
of i64_t, but I can't imagine we'd see a measurable performance gain.  Even if
we did, it would only be on 32-bit systems, because on 64-bit systems the
InStream will be mapping the whole file anyway internally and the overhead of
accessing the mapped file as a stream (using Seek, Read_U64, etc) wouldn't be
sigificant.

That leaves the lexicons and the posting lists.  Right now, supporting
divergent code for those components doesn't look too bad, but maybe that will
change if we can start dreaming up ways to exploit the mapped files.

I'd really like to have zero cache loads on Searcher startup, which means
changing how the Lexicon indexes work.

Even better would be a SortCacheWriter component that eliminates the
substantial cost of warming sort caches by writing a mappable file at index
time.  Loading the sort cache would then be as simple as mapping the file, and
warming it for ALL forks would be as simple as "cat /path/to/index/* >
/dev/null".  However, I don't presently envision such a component as belonging
to core.

> Can you even buy a 32-bit box anymore?  I expect the ones in existence will
> be around for a few more years, but when someone like Apple drops support
> for them in their OS, you know the end is nigh.

I think it's too early to drop support for 32-bit.  And if we did, IMO we'd
have to fail at compile-time -- it's not acceptable to have a search app work
for a while and then suddenly blow up when the index reaches a threshold size.

Marvin Humphrey


Re: Optimizing InStream for mmap

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 11/15/08 5:38 PM:

> 
> To be portable to 32-bit systems, core modules will have to avoid mapping
> large files.  If we want to max out the performance of PostingLists and
> Lexicons on 64-bit systems, that means we'll have to accept the increased
> maintenance burden of providing two different behaviors.  I don't think the
> burden will be too heavy, though.  

great work on the mmap front, Marvin.

How big an issue is 32-bit support? Can you even buy a 32-bit box anymore? I
expect the ones in existence will be around for a few more years, but when
someone like Apple drops support for them in their OS, you know the end is nigh.

my 2c.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com