You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2016/04/26 08:16:12 UTC

[jira] [Updated] (LUCENE-7254) DocIDSetBuilder is no good for points

     [ https://issues.apache.org/jira/browse/LUCENE-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-7254:
--------------------------------
    Attachment: LUCENE-7254.patch

Here is a patch with {{MatchingPoints}}. it tries to use all the stats we have for points to leave less performance on the table.

We can try to make it fancier later, but for now it:
1) decides up-front on sparsity, based on whether the field is sparse
2) computes cost/cardinality as 'counter' if the field is single-valued (which is exact), otherwise multiplies counter by 'docs per point' from field stats in the multi-valued case.

I see the following results in the geo benchmark:
{noformat}
boxquery (this is a 2-D PointRangeQuery): 63.4 QPS -> 85.2 QPS
distance query: 37.2 QPS -> 46.2 QPS
polygon query (n=5): 49.0 QPS -> 61.3 QPS
{noformat}


> DocIDSetBuilder is no good for points
> -------------------------------------
>
>                 Key: LUCENE-7254
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7254
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-7254.patch
>
>
> For the postings lists, I think this approach works well in dense cases (e.g. whole DISI's are added, things are coming in order, etc).
> However in the points case, it holds back range performance significantly. There are a couple of problems here:
> * expensive cardinality computation (this is a 2% hit) when its totally unnecessary. we can use index statistics to help here.
> * lots of conditional stuff in add(). This includes growing checks / bitset switching checks and so on (which happens even if you are smart and call grow, but this stuff all adds up). 
> I dont think we should try to create a magical shared API that is both efficient for postings lists of unstructured stuff and at the same time point collection for structured fields, instead we should just do things differently for points and iterate from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org