You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Daniel Penning <d....@stroeermediabrands.de> on 2019/02/11 12:42:43 UTC

Improving DocValue sorting performance

Hi

I think sorting and searching on docvalues can benefit from similar
optimizations like impacts do when scoring is used. By storing the minimum
and maximum value for blocks of docvalues it would be possible to skip over
sets of documents that don't contain any values in the relevant range.

When sorting a large resultset this could be used to skip over parts of the
index once the desired number of documents is found and be narrowed down
further step by step. This could also be used to improve performance of
docvalue based range queries used in conjunctions by only looking at blocks
of documents that actually contain values in the correct range.

Currently this is just an idea i had when i looked at the impact
implementation and i wanted get your opinion on this before i spend time
building a proof of concept implementation.

-- 
------------------------------

Daniel Penning / Senior Product Developer
T.: +49 (0)30 5900113-83 / F.: +49 (0)30 5900113-99
E-Mail: d.penning@stroeermediabrands.de
Web: www.stroeermediabrands.de

Ströer Media Brands AG, Torstraße 49, 10119 Berlin-Mitte
Vorstand: Marc Schmitz
Handelsregister: Amtsgericht Berlin-Charlottenburg HRB 126603 B

Re: Improving DocValue sorting performance

Posted by Alan Woodward <ro...@gmail.com>.

+1 This would be a very nice addition, and Toke’s recent work adding jump tables to docvalues would provide a natural place to store the information.

> On 11 Feb 2019, at 12:42, Daniel Penning <d.penning@stroeermediabrands.de <ma...@stroeermediabrands.de>> wrote:
> 
> Hi
> 
> I think sorting and searching on docvalues can benefit from similar optimizations like impacts do when scoring is used. By storing the minimum and maximum value for blocks of docvalues it would be possible to skip over sets of documents that don't contain any values in the relevant range.
> 
> When sorting a large resultset this could be used to skip over parts of the index once the desired number of documents is found and be narrowed down further step by step. This could also be used to improve performance of docvalue based range queries used in conjunctions by only looking at blocks of documents that actually contain values in the correct range.
> 
> Currently this is just an idea i had when i looked at the impact implementation and i wanted get your opinion on this before i spend time building a proof of concept implementation.
> 
> -- 
> Daniel Penning / Senior Product Developer
> T.: +49 (0)30 5900113-83 / F.: +49 (0)30 5900113-99
> E-Mail: d.penning@stroeermediabrands.de <ma...@stroeermediabrands.de>
> Web: www.stroeermediabrands.de <http://www.stroeermediabrands.de/>
> Ströer Media Brands AG, Torstraße 49, 10119 Berlin-Mitte
> Vorstand: Marc Schmitz 
> Handelsregister: Amtsgericht Berlin-Charlottenburg HRB 126603 B
>

Re: Improving DocValue sorting performance

Posted by Toke Eskildsen <to...@kb.dk>.

On Mon, 2019-02-11 at 13:42 +0100, Daniel Penning wrote:
> I think sorting and searching on docvalues can benefit from similar
> optimizations like impacts do when scoring is used. By storing the
> minimum and maximum value for blocks of docvalues it would be
> possible to skip over sets of documents that don't contain any values
> in the relevant range.

Nice idea: The docID-indexed blocks are handled by IndexedDISI, where
the blocks represents 65536 docs each. It would be simple enough to
enrich these with min & max, although it would mean 16 bytes extra for
each block, where the current minimum is 0 bytes + 8 bytes for the jump
table entry. It could be made into an option of course... As Alan
suggests, the jump tables would hold the values for maximum skippiness.

I unsure how much API-change it would require for the values to be
used. Adding a method such as nextDocIDWhereNumericMightBeHigherThan()
seems awfully specialized and might require knowledge of the block size
used inside of IndexedDISI, which is implementation specific. I do like
the general idea of "can we somehow skip ahead?".

Another sore point if how much this would statistically help. What are
the chances that a block can be skipped at all? Of course it depends on
the distribution of the values themselves, so sorting descending on a
long tail distribution might work very well, whereas a more uniform
distribution would result in no speed-up. A date field in a time-series 
corpus, indexed in chronological order, would work extremely well
sorting ascending and extremely poorly descending (possible it's the
other way around. It's getting late here).

So yeah, it is definitely technically possible and a very interesting
idea. And maybe very specialized?

Cc: to Daniel as the thread is a month old.

- Toke Eskildsen, Royal Danish Library

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org