You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2015/07/27 16:06:05 UTC

[jira] [Commented] (LUCENE-6697) Use 1D KD tree for alternative to postings based numeric range filters

    [ https://issues.apache.org/jira/browse/LUCENE-6697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642753#comment-14642753 ] 

Michael McCandless commented on LUCENE-6697:
--------------------------------------------

For a benchmark I took the same ~60.8 M lat/lon points from
OpenStreetMaps benchmark, but just indexed and searched the lat
value.  Benchmark source code is here: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java

Using LongField/NumericRangeQuery (trunk):
  * 152 sec to index
  * 42 sec to optimize
  * 491 MB total index
  * 5.6 MB heap for terms index
  * 9.6 sec for 225 queries total 667,206,375 hits

Using 1D BKD tree (patch):
  * 290 sec to index
  * 256 sec to optimize
  * 442 MB total index (204 MB doc values, 233 MB BKD tree)
  * 5.2 sec for 225 queries total 667,206,375 hits
  * 1.0 MB heap used for KD index

BKD tree is quite a bit slower to index, especially merging since it fully
re-builds the binary tree, and it's using offline sorting, but then is
~46% faster to search, and uses less heap for the in-memory index
structure.

Index size is a bit smaller, but it is also storing doc values so
e.g. you could sort by this field as well vs. LongField which would
have to also index doc values to enable sorting = larger index size.


> Use 1D KD tree for alternative to postings based numeric range filters
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-6697
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6697
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6697.patch
>
>
> Today Lucene uses postings to index a numeric value at multiple
> precision levels for fast range searching.  It's somewhat costly: each
> numeric value is indexed with multiple terms (4 terms by default)
> ... I think a dedicated 1D BKD tree should be more compact and perform
> better.
> It should also easily generalize beyond 64 bits to arbitrary byte[],
> e.g. for LUCENE-5596, but I haven't explored that here.
> A 1D BKD tree just sorts all values, and then indexes adjacent leaf
> blocks of size 512-1024 (by default) values per block, and their
> docIDs, into a fully balanced binary tree.  Building the range filter
> is then just a recursive walk through this tree.
> It's the same structure we use for 2D lat/lon BKD tree, just with 1D
> instead.  I implemented it as a DocValuesFormat that also writes the
> numeric tree on the side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org