You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ignacio Vera (JIRA)" <ji...@apache.org> on 2018/09/05 11:08:00 UTC

[jira] [Commented] (LUCENE-7862) Should BKD cells store their min/max packed values?

    [ https://issues.apache.org/jira/browse/LUCENE-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604279#comment-16604279 ] 

Ignacio Vera commented on LUCENE-7862:
--------------------------------------

I have been playing with this approach to see if we can make the BKD tree more efficient. I have done some benchmarks that confirm a big boost in performance whenever there is correlation between the dimensions so it is particularly good for ranges. In addition this boost quite a bit the newly introduced approach for indexing shapes in LUCENE-8396.

Maybe we should only ad this extra information to the index when number of dimensions > 1. Here are a few benchmarks:

1) Ranges: Using the data points from GeoBench, the ranges have been created as following:
{code:java}
double lat = Double.parseDouble(parts[1]);
double lon = Double.parseDouble(parts[2]);
double latEnd = lat + 0.1 * Math.abs(lon);
double lonEnd = lon + 0.1 * Math.abs(lat);{code}
Where latitude is used in even dimension and longitude in uneven ones.

IT = Index time (sec)
 FM = Force merge time (sec)
 IS = Index size (GB)
 RH = Reader heap (MB
|| Approach|| Dimensions||IT Dev||IT Base||IT Diff||IFM Dev||FM Base||FM Diff||IS Dev||IS Base||IS Diff||RH Dev||RH Base||IRH Diff||
|ranges|1|102.7s|101.5s|1%|0.0s|0.0s|0%|0.81|0.81|0%|0.80|0.80|-0%|
|ranges|2|96.0s|94.4s|2%|0.0s|0.0s|0%|1.59|1.58|0%|1.02|1.03|-1%|
|ranges|3|93.1s|84.6s|10%|0.0s|0.0s|0%|2.29|2.29|0%|1.18|1.01|17%|
|ranges|4|101.0s|91.0s|11%|0.0s|0.0s|0%|3.07|3.05|0%|1.04|1.12|-7%|

HS = Hits per second
 QPS= Queries per second
 Hits = Total hits
||Approach|| Dimensions||HS Dev||HS Base||HS Diff||QPS Dev||QPS Base||QPS Diff||Hits Dev||Hits Base||Hits Diff||
|ranges|1|131.17|116.20|13%|10.81|9.58|13%|2730282630|2730282630|0%|
|ranges|2|34.07|10.83|215%|16.93|5.38|215%|452860046|452860046|0%|
|ranges|3|34.41|3.83|799%|17.10|1.90|799%|452860046|452860046|0%|
|ranges|4|26.01|3.27|696%|12.92|1.62|696%|452860046|452860046|0%|.  |

 

3) GeoBench: comparison of Geo benchmark using points (LatLonPoint), geo3d (Geo3DPoint) and shapes (LaLonShape)

IT = Index time (sec)
 FM = Force merge time (sec)
 IS = Index size (GB)
 RH = Reader heap (MB
|| Approach||IT Dev||IT Base||IT Diff||IFM Dev||FM Base||FM Diff||IS Dev||IS Base||IS Diff||RH Dev||RH Base||IRH Diff||
|points|103.3s|100.0s|3%|0.0s|0.0s|0%|0.51|0.51|0%|0.61|0.61|0%|
|geo3d|105.6s|102.9s|3%|0.0s|0.0s|0%|0.72|0.72|0%|0.62|0.62|-0%|
|shapes|108.3s|100.6s|8%|0.0s|0.0s|0%|1.31|1.30|0%|0.87|0.87|-0%|

HS = Hits per second
 QPS= Queries per second
 Hits = Total hits
||Approach|| Shape||HS Dev||HS Base||HS Diff||QPS Dev||QPS Base||QPS Diff||Hits Dev||Hits Base||Hits Diff||
|points|polyRussia|17.19|17.19|0%|4.90|4.90|0%|3508846|3508846|0%|
|points|polyMedium|9.51|9.26|3%|116.51|113.40|3%|2693559|2693559|0%|
|points|distance|77.56|77.33|0%|45.57|45.43|0%|382961957|382961957|0%|
|points|nearest 10|0.05|0.05|-0%|4651.95|4664.57|-0%|60844404|60844404|0%|
|points|box|81.95|83.07|-1%|83.39|84.53|-1%|221118844|221118844|0%|
|points|poly 10|80.37|79.78|1%|50.83|50.45|1%|355809475|355809475|0%|
|points|sort|33.25|31.26|6%|33.83|31.81|6%|221118844|221118844|0%|
|geo3d|polyRussia|0.51|0.50|2%|0.15|0.14|2%|3508671|3508671|0%|
|geo3d|polyMedium|0.42|0.43|-0%|5.20|5.23|-0%|2693545|2693545|0%|
|geo3d|distance|64.36|63.47|1%|37.77|37.25|1%|383371884|383371884|0%|
|geo3d|box|49.29|51.60|-4%|50.15|52.50|-4%|221118844|221118844|0%|
|geo3d|poly 10|39.40|39.43|-0%|24.91|24.93|-0%|355855227|355855227|0%|
|shapes|polyRussia|2.81|0.31|819%|0.80|0.09|819%|3508846|3508846|0%|
|shapes|polyMedium|0.52|0.08|539%|6.38|1.00|539%|2693559|2693559|0%|
|shapes|box|11.07|1.46|661%|11.27|1.48|661%|221118844|221118844|0%|
|shapes|poly 10|16.83|1.46|1052%|10.64|0.92|1052%|355809475|355809475|0%|

 

> Should BKD cells store their min/max packed values?
> ---------------------------------------------------
>
>                 Key: LUCENE-7862
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7862
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7862.patch, LUCENE-7862.patch
>
>
> The index of the BKD tree already allows to know lower and upper bounds of values in a given dimension. However the actual range of values might be more narrow than what the index tells us, especially if splitting on one dimension reduces the range of values in at least one other dimension. For instance this tends to be the case with range fields: since we enforce that lower bounds are less than upper bounds, splitting on one dimension will also affect the range of values in the other dimension.
> So I'm wondering whether we should store the actual range of values for each dimension in leaf blocks, this will hopefully allow to figure out that either none or all values match in a block without having to check them all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org