You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/02/01 21:13:27 UTC

Re: SortCache on a 32-bit OS

On Sat, Jan 30, 2010 at 03:22:51PM -0800, Nathan Kurz wrote:
> But isn't this true of sort caches as well?  They don't
> cross segments, do they?

Correct.  Sort Caches are per-segment.  In order to compare documents across
segments, we need to perform comparisons of recovered field values -- we
cannot use ords.  

Under this change (which went into KS as r5787 and r5789), we still mmap ords,
so comparisons *within* each segment, where most of the work gets done,
haven't changed.  It's just the recovery of the field values that changes.

> OK, but you can pretty well catch this at index creation time, can't you?  

Not on indexes which grow incrementally in response to user input.  For
instance, say you have an index of user comments which grows every time
somebody leaves a comment.  Depending on how you use that index, it could be
fine for months or years, then suddenly, boom!  Out of address space.

> And even failing at run time with a clear error (mmap failed:
> too large to map) might be preferable to the sticky morass of a
> steeply declining performance curve once you start to swap.

If the error could only occur in an offline process, then I'd agree.  But for
an app that takes live updates, it's the opposite. 

> I'm all for making increasing legacy performance so long as it doesn't
> complicate the mainline architecture.

For what it's worth, the code got marginally simpler with this change -- some
method calls which had been unwrapped to deal with the raw mapped memory went
back to being method calls. :)

> > Well, that sort of sharding is not within the scope of Lucy itself.  It's a
> > Solr-level solution.
> 
> Remind me again:  what's the difference between multiple segments and
> sequential sharding?  

Sharding in this context means multiple indexes read by multiple processes,
optionally on multiple machines, with results joined together in a boss
thread. 

By breaking the index reads into multiple processes, we can divide down the
memory mapping requirements.  However, the indexes no longer move in lockstep,
since there can be multiple writers and multiple locks. 

> And if you take that world-view, what stops you from processing segments in
> parallel rather than sequentially? :)

We'll get to that.  :)

> Yes, you probably don't want to do all the cross-machine process management,
> but designing the architecture so that it's possible to aggregate and sort
> results from multiple queries seems well within bounds.

That's more or less in place, since we have an aggregator class, PolySearcher,
which can be used to collate results from multiple Searchers.  However,
managing state for fast-moving remote indexes is a problem -- it's hard to
ensure that an internal doc_id references the same document when you go to
fetch it that it referenced while you were running the Matcher.

Marvin Humphrey