You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Ignacio Vera (JIRA)" <ji...@apache.org> on 2019/07/04 09:19:00 UTC

[jira] [Resolved] (LUCENE-8888) Improve distribution of points with data dimension in BKD tree leaves

     [ https://issues.apache.org/jira/browse/LUCENE-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ignacio Vera resolved LUCENE-8888.
----------------------------------
       Resolution: Fixed
         Assignee: Ignacio Vera
    Fix Version/s: 8.2
                   master (9.0)

> Improve distribution of points with data dimension in BKD tree leaves
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-8888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8888
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Assignee: Ignacio Vera
>            Priority: Major
>             Fix For: master (9.0), 8.2
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains duplicated points. This works well with indexed dimension as the process of partition the space and the final sorting of leaves groups points with equal indexed dimensions.
> This is not the case all the time if the point contain data dimensions. It might happen that if two points have the same indexed dimensions but different data dimensions, the distribution on the leaves is not the most optimal.
> A good example is if a user tries to index a bounding box using LatLonShape. The resulting tessellation of a bounding box is two triangles with the same indexed dimensions but different data dimensions. If there are two documents indexing the same bounding box, the result in the leaf is the triangles from one document followed by the triangles of the second document. This is  because the current sorting/selection algorithms  use one indexed dimension and tie-break on the 
> docID.
> The most optimal distribution in the case above is two group together the equal triangles. Therefore what it is propose here is to update the selection/ sorting algorithms to use the data dimensions when they exist as tie-breakers before using the docID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org