You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2019/12/11 09:58:00 UTC

[jira] [Commented] (LUCENE-9087) Should the BKD tree use a fixed maxPointsInLeafNode?

    [ https://issues.apache.org/jira/browse/LUCENE-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993378#comment-16993378 ] 

Adrien Grand commented on LUCENE-9087:
--------------------------------------

I wouldn't worry about the transient write-time memory usage. We need to record the split values (let's say 8 bytes) for each inner node and the file pointer (8 bytes) for each leaf node. Assuming a single-valued field and 128 points per leaf (4x less than today) this gives maxDoc/128*(8+8)=maxDoc/8 which is the same memory usage as a bitset. We already allocate bitsets of size maxDoc for the handling of deleted documents, so I don't think allocating similar amounts of memory for points would be an issue.

 The read time memory usage is a more interesting question to me, since this one is permanent and adds up across fields and segments. One nice property of the BKD packed index is that it only seeks forward, so I wonder that we might be able to load it off-heap all the time, not only for MMapDirectory. Then if the BKD index was always off-heap, memory usage would no longer be a concern and we could feel more free to reduce the number of points in leaf nodes to improve performance?

> Should the BKD tree use a fixed maxPointsInLeafNode? 
> -----------------------------------------------------
>
>                 Key: LUCENE-9087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9087
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>
> Currently the BKD tree uses a fixed maxPointsInLeafNode provided in the constructor. For the current default codec the value is set to 1200. This is a good compromise between memory usage and performance of the BKD tree.
> Lowering this value can increase search performance but it has a penalty in memory usage. Now that the BKD tree can be load off-heap, this can be less of a concern. Note that lowering too much that value can hurt performance as well as the tree becomes too deep and benefits are gone.
> For data types that use the tree as an effective R-tree (ranges and shapes datatypes) the benefits are larger as it can minimise the overlap between leaf nodes. 
> Finally, creating too many leaf nodes can be dangerous at write time as memory usage depends on the number of leaf nodes created. The writer creates a long array of length = numberOfLeafNodes.
> What I am wondering here is if we can improve this situation in order to create the most efficient tree? My current ideas are:
>  
>  * We can adapt the points per leaf depending on that number so we create a tree with the best depth and best points per leaf. Note that for the for 1D case we have an upper estimation of the number of points that the tree will be indexing. 
>  * Add a mechanism so field types can easily define their best points per leaf. In the case, field types like ranges or shapes can define its own value to minimise overlap.
>  * Maybe the default is just too high now that we can load the tree off-heap.
>  
> Any thoughts?
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org