You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2008/10/03 19:49:01 UTC

Re: Posting codecs

On Sep 29, 2008, at 1:47 PM, Nathan Kurz wrote:

> I would try very hard to avoid doing buffer bounds checking.  

The only place where we're really going to care about the performance issue of
buffer bounds checking is in the posting lists, since that's where the rubber
hits the road at search-time.

The first goal should be to avoid per-int or per-byte bounds checking, as
happens right now in InStream_Read_C32().  (Later, we'll go further and
minimize processor decision branching using PForDelta.)

I said before that we couldn't know the know the maximum length of a
ScorePosting in its current implementation, but we can sort of know it in
stages.  Before we start reading, we know that the maximum space that the doc
num, freq, and boost at the top can take up is 11 bytes:

  doc num   => C32, max 5 bytes
  freq      => C32, max 5 bytes
  boost     => 1 byte

Once we know freq, we know the maximum number of bytes that the positions could
occupy.

  positions => freq * C32_MAX_BYTES

To take advantage of this knowledge, we can add an argument to our buffer getter.

  /** Get the InStream's buffer.  Check to see whether <code>request</code> 
   * bytes are already in the buffer.  If not, fill the buffer with either 
   * <code>request</code> bytes or the number of bytes remaining before EOF, 
   * whichever is smaller.
   * 
   * @param request Advisory byte size request.
   * @return Pointer to the InStream's internal buffer.
   */
  char*
  Buf(InStream *self, size_t request);

Here's an alternate implementation of ScorePosting's read method.  (Note that
the DECODE_C32 macro from MathUtils advances the pointer supplied to it.)

void
ScorePost_read_record(ScorePosting *self, InStream *instream)
{
    u32_t  num_prox;
    u32_t  position = 0; 
    u32_t *positions;
    const  size_t max_start_bytes = (C32_MAX_BYTES * 2) + 1;
    char  *buf = InStream_Buf(self->instream, max_start_bytes);
    const u32_t doc_code = DECODE_32(buf);
    const u32_t doc_delta = doc_code >> 1;

    /* Apply delta doc and retrieve freq. */
    self->doc_num  += doc_delta;
    if (doc_code & 1) 
        self->freq = 1;
    else
        self->freq = DECODE_32(buf);

    /* Decode boost/norm byte. */
    self->weight = self->sim->norm_decoder[ *(u8_t*)buf ];
    buf++;
    
    /* Read positions. */
    num_prox = self->freq;
    buf = InStream_Buf(self->instream, num_prox * C32_MAX_BYTES);
    if (num_prox > self->prox_cap) {
        self->prox = REALLOCATE(self->prox, num_prox, u32_t);
        self->prox_cap = num_prox;
    }
    positions = self->prox;

    while (num_prox--) {
        position += DECODE_32(buf);
        *positions++ = position;
    }

    InStream_Advance_Buf(self->instream, buf);
}

The last method invoked, InStream_Advance_Buf(), is a setter which checks to
see whether the supplied pointer indicates a "read past EOF" error.

We're going to be changing the way that Postings are read, but the principle of
splicing in periodic buffer checking instead of needing it for every int read
still applies.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: Posting codecs

Posted by Nathan Kurz <na...@verse.com>.

On Fri, Oct 3, 2008 at 11:49 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> Here's an alternate implementation of ScorePosting's read method.  (Note that
> the DECODE_C32 macro from MathUtils advances the pointer supplied to it.)

The code looks like it should give significantly better performance
than the current approach.  And yes, a guarantee to be able to read to
the end of single posting should give most of the performance
benefits.

>    while (num_prox--) {
>        position += DECODE_32(buf);
>        *positions++ = position;
>    }

I think there are going to be better ways of doing this once we don't
have to check buffer bounds.  I mentioned earlier that was looking at
a branchless way to decode VBytes. I don't have working code yet, and
it's not going to be truly branchless, but I'm pretty certain there is
room for big speedup by treating the VBytes in a byte-aligned manner.

The basic algorithm is to read several bytes, make a mask byte from
the high bits that are set, and then use the mask as a switch into a
jump table to handle the pattern of bytes encoded.  It should make it
as fast as the S9/S16 code described in the PForDelta paper, and I
think we can make it even faster with a little MMX/SSE optimization.

Thus the loop above would be replaced with something like:
vbyte_read_block(&buf, num_prox, &positions);
Perhaps you could convert the loop into an inline function of this
signature to make a switch like this easier in the future?

> (Later, we'll go further and minimize processor decision branching using
> PForDelta.)

What is your anticipated path for doing this?  Would you subclass
ScorePost as VByteScorePost and PForDeltaScorePost?  And what would
the mechanism be to ensure that  the correct one is used?

Nathan Kurz
nate@verse.com