You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2021/05/03 17:30:14 UTC

Re: Exploring a different approach to skip lists

+1 to explore innovations in Lucene's skip-list implementation!

Your initial impressive gains show that indeed skipping is a big cost for
certain queries.

Lucene's multi-level skip list implementation is quite old now (I think
more than a decade?) and I think rather complex.  There is a fun paper
describing a simple and probabilistic approach to inline the (still
multi-level) skip data, linked from this long-ago issue:
https://issues.apache.org/jira/browse/LUCENE-2962  That would be a way to
keep the sequential (slow disk friendly) access and perhaps reduce the cost
versus Lucenes skipping implementation today.

But I do love the simplicity of a simple flat array of the bottom-most skip
data.  Having that as an optional (not default) Codec would be a nice
option.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <gs...@gmail.com> wrote:

> Thanks for the reply Adrien! It makes sense to ensure the default
> codec continues to scale for large indexes that can't fit in a
> machine's physical memory. Thanks for the thoughts and context on the
> terms/points indexes, etc.
>
> I'll look into how this idea could be spun off into a separate
> lucene/codecs implementation and open a Jira issue to track that work
> a little later this week. I'll also spend a little time thinking about
> whether-or-not there might be some sort of hybrid solution that
> enables some of these gains while maintaining the sequential access. I
> don't have anything off the top-of-my-head, but maybe if I put a draft
> of my change up as a PR (under lucene/codecs) it will spark some other
> ideas!
>
> Thanks again for your thoughts!
>
> Cheers,
> -Greg
>
> On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <jp...@gmail.com> wrote:
> >
> > Hi Greg,
> >
> > I like that Lucene can scale to index sizes that are much larger than
> the amount of main memory, so I would like the default codec to keep
> optimizing for sequential reads. We do random access for some parts of the
> index like the terms index and the points index, but the expectation is
> that they are so small that would always fit in the page cache (they were
> actually in the JVM heap not long ago). A skip index for every postings
> list feels like something that could be much larger than the terms index so
> I don't think it would be a good fit for the default codec.
> >
> > We could still implement your idea in lucene/codecs for users who can
> force load their index into memory, the speedup looks significant for
> queries that do lots of skipping!
> >
> > These results make me wonder if there are other ways we could get
> similar benefits while keeping a sequential access pattern when reading
> postings?
> >
> > Le mar. 27 avr. 2021 à 21:03, Greg Miller <gs...@gmail.com> a écrit :
> >>
> >> Hey folks-
> >>
> >> I've been experimenting with a different approach to implementing skip
> >> lists in Lucene, and would be curious if anyone else has tried
> >> something similar, or has early thoughts/feedback on what I'm doing.
> >>
> >> The idea is generally a simplification of what Lucene currently does,
> >> targeted at improving QPS at search-time. Instead of treating skip
> >> lists as forward-iterating structures, I'm indexing a single, flat
> >> structure that I binary search when advancing. I'm indexing the same
> >> data present in the lowest level of the current structures (i.e., skip
> >> pointers to each block of 128 docs), and then binary searching those
> >> pointers to find the candidate block for the target docID (instead of
> >> only seeking forward).
> >>
> >> Before describing this idea in more detail (or opening a Jira), I'd be
> >> curious how this community thinks about disk access operations and
> >> what platforms/use-cases we primarily optimize for these days. This
> >> approach I've been experimenting with is essentially a non-starter if
> >> we assume that index accesses generally involve actually going to
> >> disk—especially if those disks spin. Random seeks all over the place
> >> are probably a terrible idea if those are actually hitting disk. In
> >> the cases I've been experimenting with, indexes are generally hot, so
> >> random seeks aren't much of a concern. Do we tend to optimize with
> >> this assumption in mind, or are we still pretty careful about
> >> use-cases that are actually doing a lot of disk IO?
> >>
> >> There are some other tricky implications with the approach I'm
> >> experimenting with (some edge cases around Impacts and index size
> >> growth due to not being able to do as much delta-encoding), but it's
> >> not worth going into those until addressing the main idea of
> >> whether-or-not random seeks even make sense.
> >>
> >> I've done some early benchmarking with luceneutil (wikimedium10m
> >> specifically) and the idea looks like it might have some promise. I
> >> don't really see any regressions to QPS*, while a few tasks seem to
> >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
> >> 11.1%, OrHighLow: 12.3%).
> >> * (note: I've disabled Impacts in both baseline/candidate for early
> >> experiments because of some challenges related to them... so these
> >> results have a major asteriks right now and more work would certainly
> >> need to be done)
> >>
> >> Thanks in advance for any feedback! I'm completely open to shooting
> >> down this idea early if there are some fundamental flaws, or
> >> alternatively opening up a Jira to investigate this further if folks
> >> think it's reasonable to explore more :)
> >>
> >> Cheers,
> >> -Greg
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Exploring a different approach to skip lists

Posted by Greg Miller <gs...@gmail.com>.
Thanks Mike! I've started reading that paper, and it's interesting
indeed! I don't have much time to explore this further at the moment,
but I plan to revisit when I do. It'd be fun to explore the idea in
the paper to see if a "best of both worlds" approach is possible
(speedups of binary searching while maintaining forward-iteration
paradigm)! It would be interesting to compare that implementation to
the simple, flat binary-searched list as well.

Cheers,
-Greg

On Mon, May 3, 2021 at 10:31 AM Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> +1 to explore innovations in Lucene's skip-list implementation!
>
> Your initial impressive gains show that indeed skipping is a big cost for certain queries.
>
> Lucene's multi-level skip list implementation is quite old now (I think more than a decade?) and I think rather complex.  There is a fun paper describing a simple and probabilistic approach to inline the (still multi-level) skip data, linked from this long-ago issue: https://issues.apache.org/jira/browse/LUCENE-2962  That would be a way to keep the sequential (slow disk friendly) access and perhaps reduce the cost versus Lucenes skipping implementation today.
>
> But I do love the simplicity of a simple flat array of the bottom-most skip data.  Having that as an optional (not default) Codec would be a nice option.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <gs...@gmail.com> wrote:
>>
>> Thanks for the reply Adrien! It makes sense to ensure the default
>> codec continues to scale for large indexes that can't fit in a
>> machine's physical memory. Thanks for the thoughts and context on the
>> terms/points indexes, etc.
>>
>> I'll look into how this idea could be spun off into a separate
>> lucene/codecs implementation and open a Jira issue to track that work
>> a little later this week. I'll also spend a little time thinking about
>> whether-or-not there might be some sort of hybrid solution that
>> enables some of these gains while maintaining the sequential access. I
>> don't have anything off the top-of-my-head, but maybe if I put a draft
>> of my change up as a PR (under lucene/codecs) it will spark some other
>> ideas!
>>
>> Thanks again for your thoughts!
>>
>> Cheers,
>> -Greg
>>
>> On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <jp...@gmail.com> wrote:
>> >
>> > Hi Greg,
>> >
>> > I like that Lucene can scale to index sizes that are much larger than the amount of main memory, so I would like the default codec to keep optimizing for sequential reads. We do random access for some parts of the index like the terms index and the points index, but the expectation is that they are so small that would always fit in the page cache (they were actually in the JVM heap not long ago). A skip index for every postings list feels like something that could be much larger than the terms index so I don't think it would be a good fit for the default codec.
>> >
>> > We could still implement your idea in lucene/codecs for users who can force load their index into memory, the speedup looks significant for queries that do lots of skipping!
>> >
>> > These results make me wonder if there are other ways we could get similar benefits while keeping a sequential access pattern when reading postings?
>> >
>> > Le mar. 27 avr. 2021 à 21:03, Greg Miller <gs...@gmail.com> a écrit :
>> >>
>> >> Hey folks-
>> >>
>> >> I've been experimenting with a different approach to implementing skip
>> >> lists in Lucene, and would be curious if anyone else has tried
>> >> something similar, or has early thoughts/feedback on what I'm doing.
>> >>
>> >> The idea is generally a simplification of what Lucene currently does,
>> >> targeted at improving QPS at search-time. Instead of treating skip
>> >> lists as forward-iterating structures, I'm indexing a single, flat
>> >> structure that I binary search when advancing. I'm indexing the same
>> >> data present in the lowest level of the current structures (i.e., skip
>> >> pointers to each block of 128 docs), and then binary searching those
>> >> pointers to find the candidate block for the target docID (instead of
>> >> only seeking forward).
>> >>
>> >> Before describing this idea in more detail (or opening a Jira), I'd be
>> >> curious how this community thinks about disk access operations and
>> >> what platforms/use-cases we primarily optimize for these days. This
>> >> approach I've been experimenting with is essentially a non-starter if
>> >> we assume that index accesses generally involve actually going to
>> >> disk—especially if those disks spin. Random seeks all over the place
>> >> are probably a terrible idea if those are actually hitting disk. In
>> >> the cases I've been experimenting with, indexes are generally hot, so
>> >> random seeks aren't much of a concern. Do we tend to optimize with
>> >> this assumption in mind, or are we still pretty careful about
>> >> use-cases that are actually doing a lot of disk IO?
>> >>
>> >> There are some other tricky implications with the approach I'm
>> >> experimenting with (some edge cases around Impacts and index size
>> >> growth due to not being able to do as much delta-encoding), but it's
>> >> not worth going into those until addressing the main idea of
>> >> whether-or-not random seeks even make sense.
>> >>
>> >> I've done some early benchmarking with luceneutil (wikimedium10m
>> >> specifically) and the idea looks like it might have some promise. I
>> >> don't really see any regressions to QPS*, while a few tasks seem to
>> >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
>> >> 11.1%, OrHighLow: 12.3%).
>> >> * (note: I've disabled Impacts in both baseline/candidate for early
>> >> experiments because of some challenges related to them... so these
>> >> results have a major asteriks right now and more work would certainly
>> >> need to be done)
>> >>
>> >> Thanks in advance for any feedback! I'm completely open to shooting
>> >> down this idea early if there are some fundamental flaws, or
>> >> alternatively opening up a Jira to investigate this further if folks
>> >> think it's reasonable to explore more :)
>> >>
>> >> Cheers,
>> >> -Greg
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org