You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/01/30 19:11:22 UTC

SortCache on a 32-bit OS

Greets,

As discussed previously and prototyped in KinoSearch, sort caches consist of
either 2 or 3 "files" (actually, virtual files within the compound file).  

Variable width types:

    .ord
    .ix
    .dat

Fixed width types:

    .ord
    .dat

In the prototype implementation, all of these files get memory-mapped at
SortCache object construction time.  However, for very large indexes, this
poses a problem on 32-bit operating systems.

Multi-gigabyte indexes work fine on a 32-bit OS, provided that you actually
have enough RAM to keep the whole index RAM-resident in the system IO cache.
However, with enough sort fields and enough unique values, you can exceed
the 32-bit address space limitation under the current design.

To solve this problem, I think we ought to mmap the ords file only and use
sequential reads to recover values -- the same way we do with our lexicons.
The price will be slightly increased CPU costs under a couple of
circumstances:

  * Looking up sortable values at the close of each segment when matching.
  * Finding the ord of a term when preparing a range filter for each segment.

The increased CPU costs come from extra seeks, memory maps, and memory copies.
Right now for TextSortCache objects, we use ViewCharBufs and just assign
pointers within the memory mapped region.  We would have to change to real
CharBufs and perform memory copies, since the mapped region would no longer be
mapped for the life of the parent SortCache -- but that's more robust anyway.

I believe that with this plan we can push the index size at which address
space runs out beyond the practical size for a single machine -- even when
you're doing something silly like running a 32-bit OS on a box with 16 gigs of
RAM.

Marvin Humphrey

Re: [Lucy] Re: SortCache on a 32-bit OS

Posted by Peter Karman <pe...@peknet.com>.

Nathan Kurz wrote on 01/30/2010 05:22 PM:

>> I should specify that the extra calls to mmap() and munmap() occur on 32-bit
>> systems only.  For 64-bit systems, we mmap() the whole compound file the
>> instant it gets opened, and InStream_Buf() is just a thin wrapper around some
>> pointer math.
> 
> I had not realized that.  This softens my position considerably.  I'm
> all for making increasing legacy performance so long as it doesn't
> complicate the mainline architecture.
> 

++

This is an interesting thread, though I have nothing interesting to add. :)

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: SortCache on a 32-bit OS

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Jan 30, 2010 at 03:22:51PM -0800, Nathan Kurz wrote:
> But isn't this true of sort caches as well?  They don't
> cross segments, do they?

Correct.  Sort Caches are per-segment.  In order to compare documents across
segments, we need to perform comparisons of recovered field values -- we
cannot use ords.  

Under this change (which went into KS as r5787 and r5789), we still mmap ords,
so comparisons *within* each segment, where most of the work gets done,
haven't changed.  It's just the recovery of the field values that changes.

> OK, but you can pretty well catch this at index creation time, can't you?  

Not on indexes which grow incrementally in response to user input.  For
instance, say you have an index of user comments which grows every time
somebody leaves a comment.  Depending on how you use that index, it could be
fine for months or years, then suddenly, boom!  Out of address space.

> And even failing at run time with a clear error (mmap failed:
> too large to map) might be preferable to the sticky morass of a
> steeply declining performance curve once you start to swap.

If the error could only occur in an offline process, then I'd agree.  But for
an app that takes live updates, it's the opposite. 

> I'm all for making increasing legacy performance so long as it doesn't
> complicate the mainline architecture.

For what it's worth, the code got marginally simpler with this change -- some
method calls which had been unwrapped to deal with the raw mapped memory went
back to being method calls. :)

> > Well, that sort of sharding is not within the scope of Lucy itself.  It's a
> > Solr-level solution.
> 
> Remind me again:  what's the difference between multiple segments and
> sequential sharding?  

Sharding in this context means multiple indexes read by multiple processes,
optionally on multiple machines, with results joined together in a boss
thread. 

By breaking the index reads into multiple processes, we can divide down the
memory mapping requirements.  However, the indexes no longer move in lockstep,
since there can be multiple writers and multiple locks. 

> And if you take that world-view, what stops you from processing segments in
> parallel rather than sequentially? :)

We'll get to that.  :)

> Yes, you probably don't want to do all the cross-machine process management,
> but designing the architecture so that it's possible to aggregate and sort
> results from multiple queries seems well within bounds.

That's more or less in place, since we have an aggregator class, PolySearcher,
which can be used to collate results from multiple Searchers.  However,
managing state for fast-moving remote indexes is a problem -- it's hard to
ensure that an internal doc_id references the same document when you go to
fetch it that it referenced while you were running the Matcher.

Marvin Humphrey

Re: SortCache on a 32-bit OS

Posted by Nathan Kurz <na...@verse.com>.

On Sat, Jan 30, 2010 at 1:15 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sat, Jan 30, 2010 at 12:11:41PM -0800, Nathan Kurz wrote:
>
>> The window where this choice is beneficial is small:  something like
>> 32-bit systems using 2-4 Gig indexes with multiple sortable fields
>> with unique values.   Unless this is the use case that Eventful needs,
>
> Well, actually... yes, it is.

Then you should do it!  As long as you are designing it around a real
need, it will probably be a good design choice.

> Indexes can actually grow larger than 2-4 GB on such systems and still
> maintain top performance.  Because 32-bit operating systems can exploit the
> full RAM on a machine and use it for system IO cache, you can have indexes
> over 4 GB that stay fully RAM-resident.

Definitely right, but I'm most interested in cases that allow
searching for full quotes, hence no stop words.  In my mind, once you
can't map in positions for the word 'the', you're done.   The obvious
answer to this is that it's segment size, rather than index size, that
matters here.  But isn't this true of sort caches as well?  They don't
cross segments, do they?

> The problem with running out of address space is that there's no warning
> before catastrophic failure, and then no possibility of recovery short of
> rearchitecting your search infrastructure or installing a new operating
> system.  It's a really serious glitch to hit.  It would suck if Eventful hit
> it, but I really don't want anybody else to hit it either.

OK, but you can pretty well catch this at index creation time, can't
you?  And even failing at run time with a clear error (mmap failed:
too large to map) might be preferable to the sticky morass of a
steeply declining performance curve once you start to swap.

> I should specify that the extra calls to mmap() and munmap() occur on 32-bit
> systems only.  For 64-bit systems, we mmap() the whole compound file the
> instant it gets opened, and InStream_Buf() is just a thin wrapper around some
> pointer math.

I had not realized that.  This softens my position considerably.  I'm
all for making increasing legacy performance so long as it doesn't
complicate the mainline architecture.

>> Sure, these systems will exist, but solve the problem in way that benefits
>> everyone:  shard it!
>
> Well, that sort of sharding is not within the scope of Lucy itself.  It's a
> Solr-level solution.

Remind me again:  what's the difference between multiple segments and
sequential sharding?  And if you take that world-view, what stops you
from processing segments in parallel rather than sequentially? :)
Yes, you probably don't want to do all the cross-machine process
management, but designing the architecture so that it's possible to
aggregate and  sort results from multiple queries seems well within
bounds.

Nathan Kurz
nate@verse.com

Re: SortCache on a 32-bit OS

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Jan 30, 2010 at 12:11:41PM -0800, Nathan Kurz wrote:

> The window where this choice is beneficial is small:  something like
> 32-bit systems using 2-4 Gig indexes with multiple sortable fields
> with unique values.   Unless this is the use case that Eventful needs,

Well, actually... yes, it is.

However, I don't think that's a bad thing in this case.  Everyone at Eventful
involved with Lucy and KinoSearch agrees that we should not hammer misfeatures
into open source code.  This is a position of enlightened self-interest.  If
we were to do such a thing, and if I ever leave Eventful, all of a sudden
Eventful is counting on misfeatures but doesn't have influence on a project
leader willing to abuse his position and keep the misfeatures jammed in there.
Eventful is much more interested in sponsoring *good* work that they can count
on regardless of whether I continue to be employed by them.  

Eventful's need for fast index reopens was what drove the accelerated
development of mmap'd sort caches in the first place.  It got priority bumped
because of who pays my salary, but it's a great feature, regardless.  What I'm
proposing now is also useful for Lucy at large.  It's not a hack.  If I'm
gonna write hacks for Eventful, they'll stay in private code.

Indexes can actually grow larger than 2-4 GB on such systems and still
maintain top performance.  Because 32-bit operating systems can exploit the
full RAM on a machine and use it for system IO cache, you can have indexes
over 4 GB that stay fully RAM-resident.

> Would more than a handful of people benefit from this?  

Yes, I believe so -- but more to the point, the people who benefit from this
benefit greatly.

The problem with running out of address space is that there's no warning
before catastrophic failure, and then no possibility of recovery short of
rearchitecting your search infrastructure or installing a new operating
system.  It's a really serious glitch to hit.  It would suck if Eventful hit
it, but I really don't want anybody else to hit it either.  

When I worked out the windowing system that provides Lucy with 32-bit
compatibility, I thought I had solved this problem and that no one would ever
hit the out-of-address-space limitation.  It was only when we actually built
some gigantic indexes -- that happen to perform great because they don't ever
need to iterate large posting lists -- that I realized the current
implementation of SortCache might pose a problem.

> > The increased CPU costs come from extra seeks, memory maps, and memory copies.
> 
> In general, burning CPU instructions is no problem, but consuming
> excess memory IO bandwidth should be stringently avoided.  

I should specify that the extra calls to mmap() and munmap() occur on 32-bit
systems only.  For 64-bit systems, we mmap() the whole compound file the
instant it gets opened, and InStream_Buf() is just a thin wrapper around some
pointer math.  

The seeks, likewise, are not system calls on 64-bit systems -- they're just a
method call and pointer math.

The only real downside is the cost of copying text data rather than copying
pointers.  But even then, we have to read over all character data with each
call to TextSortCache_Value() anyway, because we have to perform a UTF-8
sanity check.  Plus, the cost scales with the number of segments rather than
the number of documents.  Plus, CharBufs are easier to deal with than
ViewCharBufs because you don't have to worry about the parent object
disappearing.

> > I believe that with this plan we can push the index size at which address
> > space runs out beyond the practical size for a single machine -- even when
> > you're doing something silly like running a 32-bit OS on a box with 16 gigs of
> > RAM.
> 
> Sure, these systems will exist, but solve the problem in way that benefits
> everyone:  shard it!  

Well, that sort of sharding is not within the scope of Lucy itself.  It's a
Solr-level solution.

> Instead of trying to cram a 16 GB index into a 3GB process address space
> through inefficient tricks, run 8 2GB shards on the same machine, or better
> yet across 2 machines.  Then there is no hard limit at max RAM, instead you
> just add another machine.

Heh.  I couldn't agree more.  Believe me.

Marvin Humphrey

Re: SortCache on a 32-bit OS

Posted by Nathan Kurz <na...@verse.com>.

On Sat, Jan 30, 2010 at 10:11 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> To solve this problem, I think we ought to mmap the ords file only and use
> sequential reads to recover values -- the same way we do with our lexicons.
> The price will be slightly increased CPU costs under a couple of
> circumstances:

In practice, this is probably fine, but I'm not convinced it's a good
path to follow.

I think you are better off declaring that a 64-bit OS is the target
platform for large indexes, and making optimizations for this platform
even if it reduces utility for those who choose to use a 32-bit
system.   I think you'll end up with simpler code, and better
performance.

The window where this choice is beneficial is small:  something like
32-bit systems using 2-4 Gig indexes with multiple sortable fields
with unique values.   Unless this is the use case that Eventful needs,
I don't think it's worth optimizing.    Would more than a handful of
people benefit from this?  Is this worth the significantly reduced
performance on a truly gigantic index?

> The increased CPU costs come from extra seeks, memory maps, and memory copies.

In general, burning CPU instructions is no problem, but consuming
excess memory IO bandwidth should be stringently avoided.  Look to the
multicore future:  you're going to have more and more processors
fighting for access to the same pool of RAM.  Reduce this contention
however you can, target the systems people will be running in 3-5
years, and you'll fly.

> I believe that with this plan we can push the index size at which address
> space runs out beyond the practical size for a single machine -- even when
> you're doing something silly like running a 32-bit OS on a box with 16 gigs of
> RAM.

Sure, these systems will exist, but solve the problem in way that
benefits everyone:  shard it!  Instead of trying to cram a 16 GB index
into a 3GB process address space through inefficient tricks, run 8 2GB
shards on the same machine, or better yet across 2 machines.  Then
there is no hard limit at max RAM, instead you just add another
machine.

32-bit machines will run more small shards (varying in size based on
the address space required), 64-bit machines will run fewer large ones
(limited by system RAM and number of cores).  But once you solve this
problem efficiently, you scale to to the moon.

Nathan Kurz
nate@verse.com