You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/03/05 18:32:00 UTC
[jira] [Commented] (LUCENE-10425) count aggregation optimization inside one segment in log scenario

    [ https://issues.apache.org/jira/browse/LUCENE-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501790#comment-17501790 ] 

Adrien Grand commented on LUCENE-10425:
---------------------------------------

This would require a new API on PostingsEnum so I'll write what I think the applications of this change are to make sure I get the benefits correctly: If the field is sorted by a numeric field, Lucene could efficiently compute range facets (and special forms of range facets like histograms) for this numeric field as long as there are no deletions and the query has a single term. For instance, an index containing logs and sorted by timestamp could very efficiently compute an histogram of the timestamp field given any term query. To use an example from a different use-case, an index of an e-commerce catalog sorted by price could compute a histogram of prices very efficiently for any term query.

This feels quite powerful. The main thing that annoys me a bit is that it only works on the primary sort field, so we'd be adding an API for PostingsEnum for something that requires a very careful setup of the index as their can be a single primary sort field. I wonder if LUCENE-10396 could help this optimization more often applicable, e.g. to logs indices sorted by host then timestamp, or to e-commerce indices sorted by category then price. Having this optimization more generally applicable would make me feel better about increasing the surface area of PostingsEnum. At first sight, it feels like this should work? Maybe this use-case would also help figure out what the API should be on LUCENE-10396.

> count aggregation optimization inside one segment in log scenario
> -----------------------------------------------------------------
>
>                 Key: LUCENE-10425
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10425
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>            Reporter: jianping weng
>            Priority: Major
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In log scenario, we usually want to know the doc count of documents between every time intervals. One possible optimized method is to sort the docuemt in ascend order according to @timestamp field in one segment. then we can use    this pr [https://github.com/apache/lucene/pull/687] to find out the min/max docId in on time interval.
> If there is no other filter query, the doc count of one time interval is (max docId- min docId +1)
> if there is only one another term filter query, we can use this pr [https://github.com/apache/lucene/pull/688 |https://github.com/apache/lucene/pull/688]to get the diff value of index, when we call advance(minId) and advance(maxId), the diff value is also the doc count of one time interval
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org