You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Nicholas Knize (JIRA)" <ji...@apache.org> on 2018/09/12 22:40:00 UTC
[jira] [Commented] (LUCENE-8496) Explore selective dimension indexing in BKDReader/Writer

    [ https://issues.apache.org/jira/browse/LUCENE-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612826#comment-16612826 ] 

Nicholas Knize commented on LUCENE-8496:
----------------------------------------

Initial patch provided:

The lionshare of the changes are made to {{FieldType}}, {{BKDWriter}}, and {{BKDReader}}.

* {{FieldType}} - split {{pointDimensionCount}} into two new integers that define {{pointDataDimensionCount}} and {{pointIndexDimensionCount}}. {{pointIndexDimensionCount}} must be <= {{pointDataDimensionCount}} and defines the first {{n}} dimensions that will be used to build the index. The remaining {{pointDataDimensionCount}} - {{pointIndexDimensionCount}} dimensions are ignored while building (e.g., split/merge) the index. Getter and Setter utility methods are added.

* {{BKDWriter}} - change {{writeIndex}} to encode and write {{numIndexDims}} in the 2 most significant bytes of the integer that formerly stored {{numDims}} this provides simple backwards compatability without requiring a change to {{FieldInfoFormat}}. Indexing methods are updated to only use the first {{numIndexDims}} while building the tree. Leaf nodes still use {{numDataDims}} for efficiently packing and compressing the leaf level data (data nodes).

* {{BKDReader}} - add version checking in the constructor to decode {{numIndexDims}} and {{numDataDims}} from the packed dimension integer. Update index reading methods to only look at the first {{numIndexDims}} while traversing the tree. {{numDataDims}} are still used for decoding leaf level data.

* API Changes - all instances of {{pointDimensionCount}} have been updated to {{pointDataDimensionCount}} and {{pointIndexDimensionCount}} to reflect total number of dimensions, and number of dimensions used for creating the index, respectively.

* All POINT Tests and POINT based Fields have been updated to use the API changes.

Benchmarking
---

To benchmark the changes I update {{LatLonShape}} (not included in this patch) and ran benchmark tests both with and without selective indexing. The results are below: 

6 dimension encoded {{LatLonShape}} w/o selective indexing
------
INDEX SIZE: 1.2795778876170516 GB
READER MB: 1.7928361892700195
BEST M hits/sec: 11.67378231920028
BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713

7 dimension LatLonShape encoding w/ 4 dimension selective indexing
-------
INDEX SIZE: 2.1509012933820486 GB
READER MB: 1.8154268264770508
BEST M hits/sec: 17.018094815004627
BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713

The gains are a little better than the differences between searching a 4d range vs a 6d range. The index size increased due to using 7 dimensions instead of 6, but I also switched over to a bit bigger encoding size.

> Explore selective dimension indexing in BKDReader/Writer
> --------------------------------------------------------
>
>                 Key: LUCENE-8496
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8496
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Nicholas Knize
>            Priority: Major
>         Attachments: LUCENE-8496.patch
>
>
> This issue explores adding a new feature to BKDReader/Writer that enables users to select a fewer number of dimensions to be used for creating the BKD index than the total number of dimensions specified for field encoding. This is useful for encoding dimensional data that is used for interpreting the encoded field data but unnecessary (or not efficient) for creating the index structure. One such example is {{LatLonShape}} encoding. The first 4 dimensions may be used to to efficiently search/index the triangle using its precomputed bounding box as a 4D point, and the remaining dimensions can be used to encode the vertices of the tessellated triangle. This causes BKD to act much like an R-Tree for shape data where search is distilled into a 4D point (instead of a more expensive 6D point) and the triangle is encoded using a portion of the remaining (non-indexed) dimensions. Fields that use the full data range for indexing are not impacted and behave as they normally would.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org