You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2012/08/24 20:33:00 UTC

[lucy-dev] lucene block postings format

Just ran across this:

http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html

comments on whether there is anything to be learned here for Lucy?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] lucene block postings format

Posted by Nathan Kurz <na...@verse.com>.
A new paper on a similar topic:

http://lemire.me/blog/archives/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/

http://arxiv.org/pdf/1209.2137v1.pdf

"we introduce a novel vectorized scheme called SIMD-BP128 that
improves over previously proposed vectorized approaches. It is nearly
twice as fast as the previously fastest schemes on desktop processors
(varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2
bits per integer. For even better compression, we propose another new
vectorized scheme (SIMD-FastPFOR) that has a compression rate within
10% of a state-of-the-art scheme (Simple-8b) while being two times
faster during decoding."

I haven't read the details, but at a quick skim I think they reach the
same conclusions that PackedBinary (FOR) is significantly faster than
PFOR, and has only slightly less compression.   But they then go on to
show that this benefit is minimal compared to the advantage of using
vectorized SIMD instructions.

I'm leery of their thought processes, though:
"According to our analysis, one decoding step of this algorithms would
require at least 11 assembly instructions. In contrast, one step of
our interleaved bit unpacking from 3 (used by our   SIMD-BP128) would
typically require only 5 and 10 assembly instructions in a typical and
worst case, respectively."

In a paper that is about the benefits of SIMD, an argument presuming
that fewer instructions means faster is radically out of place.  Even
without SIMD, different instructions take vastly different numbers of
cycles.  I'm tempted to think that one of the authors "gets it" and
the other doesn't.

--nate

Re: [lucy-dev] lucene block postings format

Posted by Nathan Kurz <na...@verse.com>.
On Fri, Aug 24, 2012 at 6:36 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Fri, Aug 24, 2012 at 11:33 AM, Peter Karman <pe...@peknet.com> wrote:
>> Just ran across this:
>>
>> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html
>>
>> comments on whether there is anything to be learned here for Lucy?

I just read through the JIRA issue for this, and had trouble
distinguishing their chosen algorithm from the Java and Lucene
specific details. As regards Lucy, I think the main applicable
conclusion was that using a straight "FOR" (Frame of Reference)
approach was slightly more efficient than using "PFOR" (Patched Frame
of Reference), and only slightly worse in index size.   But I'm not
sure how much this is due to the specific decoder they are using
versus anything inherent to the patching.  On the other hand, since
the simple FOR (which I think is the same as they are calling
PackedInt) is conceptually simpler and the index sizes are not
significantly larger, this may be moot.

> As far as what we might have learned from the Lucene implementation, I think
> there are two things:
>
> 1.  Block posting formats are more suitable for modern pipelining processors
>     than the bytewise compression we use now.

I'd agree.  I was interested in an observation in the commentary
guessing that they were running up against Memory Bandwidth  (rather
than CPU) as a limiting factor.  I think this will definitely hold for
Lucy:  at a certain point improvements in decoding efficiency don't
matter if the CPU is starved for data.  Raw ints are definitely memory
bound, variable length ints are CPU bound, and the various FOR and
PFOR implementations straddle the line.   The goal is not just to
decode quickly, or even to selectively decode, but to never even fetch
the blocks you don't need.  Operating in C, I think/hope we may an
advantage here as we can hint to the processor on prefetching.

> 2.  Pluggable posting formats are the wrong level of abstraction.
>
> If we transition to a block posting format like PFORdelta, we will save on
> decoding posting lists, but gains will be limited because we still have to
> access the decoded integers one at a time inside TermMatchers wrapped in
> expensive ORMatchers, etc.  (If you look at profiling data for our
> scoring/matching methods, the bottlenecks are not in InStream, but in higher
> level Matchers like OrMatcher etc.)
>
> Java has an advantage here because hotspot compilers enable inlining of method
> calls in a way we can't duplicate from C -- so those higher level calls may
> get streamlined away to an extent.  However, we can change our architecture so
> that it becomes possible to inline procedures manually.
>
> If we really want to tear it up, I believe we need to implement the
> "MatchEngine" level of abstraction, as proposed on this list a while back.

That's an interesting point about the advantage of a hotspot compiler
--- once you have it in place, it does give you a lot for free.  I'm
not sure that we actually need to inline our calls, though, so much as
limit the number of layers of to something reasonable.  I think we can
match their performance just by being sensible, and pull ahead by
integrating more closely with the metal on the very bottom.

I wouldn't give up on pluggable posting formats yet.

I still think that there is a fine solution where we define an
interchange data format that each plugin decodes to, and each scorer
knows how to read from.   Paying particular attention to cache lines,
the plugin fetches a compressed format from memory, and decodes it to
a static buffer with as few mispredicted branches as possible.  The
scorer then operates on this decoded block directly, no layers
involved.  So long as this static buffer lives in L1, the matching
runs at CPU speed rather than RAM speed, and by having a fixed format,
the scorers can be optimized in advance. If we can sufficiently index
into our format so as to avoid unnecessary reads/decodes, I think we
can fly, no hotspot required or MatchEngine required.

--nate

Re: [lucy-dev] lucene block postings format

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Aug 24, 2012 at 11:33 AM, Peter Karman <pe...@peknet.com> wrote:
> Just ran across this:
>
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html
>
> comments on whether there is anything to be learned here for Lucy?

FWIW: it's not public, but we have a pluggable postings API in Lucy which
would allow us to prototype a block posting format such as PFORdelta.

As far as what we might have learned from the Lucene implementation, I think
there are two things:

1.  Block posting formats are more suitable for modern pipelining processors
    than the bytewise compression we use now.
2.  Pluggable posting formats are the wrong level of abstraction.

If we transition to a block posting format like PFORdelta, we will save on
decoding posting lists, but gains will be limited because we still have to
access the decoded integers one at a time inside TermMatchers wrapped in
expensive ORMatchers, etc.  (If you look at profiling data for our
scoring/matching methods, the bottlenecks are not in InStream, but in higher
level Matchers like OrMatcher etc.)

Java has an advantage here because hotspot compilers enable inlining of method
calls in a way we can't duplicate from C -- so those higher level calls may
get streamlined away to an extent.  However, we can change our architecture so
that it becomes possible to inline procedures manually.

If we really want to tear it up, I believe we need to implement the
"MatchEngine" level of abstraction, as proposed on this list a while back.

     my $query    = $queryparser->parse($query_string);
-    my $compiler = $query->make_compiler(searcher => $searcher);
+    my $compiler = $match_engine->make_compiler(
+        query    => $query,
+        searcher => $searcher,
+    );

MatchEngine abstracts the entire matching and scoring engine.  If we make a
MatchEngine implementation a closed system, then all potential Matcher classes
become known, and we can build specialized loop code.

    if (Matcher_Get_VTable(sub_matcher) == PHRASEMATCHER) {
        /* tight loop optimized for PhraseMatcher goes here */
    }
    else if (Matcher_Get_VTable(sub_matcher) == TERMMATCHER) {
        /* tight loop optimized for PhraseMatcher goes here */
    }
    ...

If we are able to e.g. discover that an ORMatcher consists of only two
TermMatcher children, we can reach in and grab their posting lists and access
the integer blocks directly rather than having to go through an OO interface
that is flexible, but comparatively expensive.

At that point, the gains yielded by block posting formats would loom larger.

Marvin Humphrey