You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nathan Kurz <na...@verse.com> on 2008/10/02 20:30:28 UTC

Re: PForDelta compression

On Tue, Sep 30, 2008 at 3:09 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> The conclusion that Mike McCandless and I reached, now reinforced by recent
> benchmarking, is that we need to break up that data into separate streams.

This seems like a good idea in general.  The biggest win is going to
come from avoiding per-int function overhead, but it looks like there
should be a significant advantage from PForDelta (or something custom
based on the same principles).

> In a previous post, I mentioned that tricky part about transitioning the
> current KS code base to PForDelta is that PForDelta uses block encoding.  I
> think what we'll have to do is store both the file position where the block
> starts and the offset into the block in the TermInfo we get when looking up the
> term in a Lexicon.

I'm a little concerned that the overhead of doing this is going to be
greater than the cost of  VByte decompression.   I'd wonder if
initially doing only the position stream with PForDelta makes more
sense.

More generally, this reinforces my feeling that the the Scorers need
to be fully walled off from the Index format.  I'm still of the
feeling that passing a Posting to be filled will lead to a cleaner and
faster solution than exposing the Stream to the outside.

> PS: For the lucy-dev mail archives, the paper comparing PForDelta to VByte and
> other schemes is at <http://www2008.org/papers/pdf/p387-zhangA.pdf>, and the
> paper which describes PForDelta in depth is at
> <http://homepages.cwi.nl/~heman/downloads/msthesis.pdf>.

Thanks!  I hadn't seen the thesis paper before.  Glancing over it, I
noticed that the section 3.2.2 "Minimize Memory Traffic" gave a
clearer explanation than I did of why the bulk read of the postings
might not be a win.

Nathan Kurz
nate@verse.com