You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/12/23 10:30:58 UTC
[jira] [Updated] (LUCENE-7401) BKDWriter should ensure all dimensions are indexed

     [ https://issues.apache.org/jira/browse/LUCENE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7401:
---------------------------------
    Attachment: LUCENE-7401.patch

bq. That's a good example! In that case, with our current splitting, running a range filter for "small population" will be costly. Though, without other filters (by lat/lon) it will likely be costly anyway since town population is probably Zipf's law like? I.e., most areas will still have many more small population towns than big ones.

Right, a filter for "small population" is costly, but I don't think we can do much about it since it is due to the fact it would match lots of documents. The problem that would like to address here is the opposite: filtering for "large population", for instance: "Find all cities in America that have more than 10K inhabitants". This would be a selective filter, so hopefully something that we can execute efficiently. However with the current way the splitting works, the population dimension might never be indexed (because BKD would always decide to index the lat or lon instead, which have larger ranges) and such a query, which is very selective on the population dimension, would likely have to visit all documents that match "America" to find matches, which is disappointing.

Here is a patch that implements what I had in mind when opening this ticket. It ensures that every dimension gets split no less than 2x less than the dimension that had the most splits. So back to our 3 dimensions example with lat, lon and population, let's assume we have 1M docs, it would mean we need to split 10 times. In such a scenario, we would likely split 4 times on lat, 4 times on lon and 2 times on the population dimension.

I think it is a better trade-off since it has a better worst case for selective queries. For instance the fact that today the geo dimensions would get 10 splits means that a selective geo query would have to visit 1/1024th of the index, but a selective query on population would have to perform a linear scan. With this patch a selective geo query would have to visit 1/256th of the index (8 splits), which is slower, however a selective query on the population dimension would only have to visit 1/4th of the index (2 splits).

> BKDWriter should ensure all dimensions are indexed
> --------------------------------------------------
>
>                 Key: LUCENE-7401
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7401
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7401.patch
>
>
> The current heuristic is to use the dimension that has the largest span, so if dimensions have a different distribution of values, we could theoretically (but maybe in practice too?) end up with one dimension that is not indexed at all and queries that are mostly selective on this dimension would need to scan lots of blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org