You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/10/26 21:37:59 UTC

[GitHub] [pinot] lfernandez93 opened a new issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

lfernandez93 opened a new issue #7642:
URL: https://github.com/apache/pinot/issues/7642


   When running a query like the following
   
   `SELECT COUNT(*) FROM table WHERE id = x AND time BETWEEN a AND b`
   
   `numEntriesScannedInFilter` is showing more records than the ones scanned `numDocsScanned` , why is this the case? shouldn't it only scan the number that we have selected already on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953859463


   @lfernandez93 this is great, but how much faster?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953220763


   to reiterate I should upgrade to latest and then
   
   ```
         "rangeIndexColumns": [
           "serve_time"
         ],
         "rangeIndexVersion": 2,
   ```      


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953949427


   in that example query we went from 50 ms to 10 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 closed issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 closed issue #7642:
URL: https://github.com/apache/pinot/issues/7642


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
kishoreg commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952363577


   What @richardstartin said is probably explaining what you are seeing in your query. 
   
   However, it's important to note that there is no strict relationship (greater than or less than) between numEntriesScannedinFilter and numDocsScanned.
   
   This video might help in understanding all the variables in the query response - https://www.youtube.com/watch?v=VdwVDiXOOVo
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953846071


   after upgrading pinot and making the above change i can confirm that the above query now is not only faster but `numEntriesScannedInFilter` went from > than a mill to 0. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
kishoreg commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953001970


   If you use the new range index added by @richardstartin, it will come down to 0. Apologies for repeating this again, don't try to correlate numScannedInFilter and numDocsScanned. Think of query processing as having two parts
   
   1.  Filter (applies all the predicates to find the matching rows)
   2. Post Filter (aggregation/group by or just select + order). In this phase, all the matching rows are scanned
   
   numEntriesScannedInFilter is the proxy for work done in the filter phase. The value for this can be anywhere between
   - totalDocs * numColsIn Filter Clause - this is the worst case and happens when there is no pruning, no indexes
   - 0: this is the best case when the right indexes with the right configuration are used. 
   
   numDocsScanned is the proxy for work done post filter phase. The value for this is number of rows that match the filter condition. However, this can be lowered by using star-tree index.
   
   Hope that helps.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952354200


   This will be because the query uses a range index on `time`. The range index splits the rows in each segment into (by default) 20 buckets each with a bitmap of row ids which have `time` within the bucket. The buckets which `a` and `b` are in need to be scanned, which is where `numEntriesScannedInFilter` comes from, but the buckets between these buckets don't. 
   
   A new kind of range index which doesn't need to perform scans, which leads to accelerated range queries, was added in #7454, if you are building from source you can try it by adding `"rangeIndexVersion": 2` to your `tableConfig` - otherwise it will be in the 0.9.0 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953110362


   You do do have an implicit range index on `serve_time`. Most of the time is spent in the range filter operator ("RangeFilterOperator" - class `RangeIndexBasedFilterOperator`), and this is where the high `numEntriesScannedInFilter` comes from. 
   
   This would be fixed by setting "rangeIndexVersion": 2 in 0.9.0 (or now if you're building pinot yourself), but we can do better still: the bitmap from the `user_id` filter _could_ be passed down to the `RangeIndexBasedFilterOperator` to accelerate the range query further, as per #7597.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 edited a comment on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 edited a comment on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953846071


   after upgrading pinot and making the above change i can confirm that the above query now is not only faster but `numEntriesScannedInFilter` went from > than a mill to 0. 
   
   thank you so much @richardstartin and @kishoreg 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952894217


   in the queried mentioned above there's no range index in the `time` column would that still be expected? or does that happen by default.
   
   additional info:
   
   there's a bloomFilter on the `id` and it's the `sortedColumn` as well. Also the `segmentPartitionMap` is happening on the `id` we used to have a `rangeIndex` on the `time` column but we removed it, what would that do in that case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952358288


   This is also related to #7597 (propagate filter state to subsequent filtering operators) and #7600 (if an operator is selective enough, skip subsequent indexes and scan instead to complete the query)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952989179


   @lfernandez93 could you provide a trace from the query console (select "Show JSON Format") and paste it back here please? Also can you share the parts of your table config which relate to `id` and `time`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 edited a comment on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 edited a comment on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-952894217


   in the query mentioned above there's no range index in the `time` column would that still be expected? or does that happen by default.
   
   additional info:
   
   there's a bloomFilter on the `id` and it's the `sortedColumn` as well. Also the `segmentPartitionMap` is happening on the `id` we used to have a `rangeIndex` on the `time` column but we removed it, what would that do in that case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 edited a comment on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 edited a comment on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953099072


   thanks @kishoreg this explains a lot then. basically if `numEntriesScannedInFilter` is high means that selectivity is high and maybe we need some more pruning that number definitely wouldn't correlate to `numDocsScanned` which happens after we have done postFilter which will be the final result(?).
   
   I just got weird out cause as soon as I remove the `time` then the query is much faster and I was wondering why if I'm trying to select even less data with `time` I do see `numEntriesScannedInFilter` go to 0 after i remove the `time` clause.
   
   @richardstartin trace info
   
   `
       "traceInfo": {
           "pinot-server-1.pinot-server-headless.pinot.svc.cluster.local": "[{\"0\":[{\"GroupByOrderByCombineOperator Time\":53},{\"InstanceResponseOperator Time\":53}]},{\"0_0\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"AggregationGroupByOrderByOperator Time\":0}]},{\"0_1\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"ScanBasedFilterOperator Time\":0},{\"AndFilterOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"AggregationGroupByOrderByOperator Time\":0}]},{\"0_2\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"RangeFilterOperator Time\":53},{\"AndFilterOperator Time\":53},{\"DocIdSetOperato
 r Time\":53},{\"ProjectionOperator Time\":53},{\"PassThroughTransformOperator Time\":53},{\"AggregationGroupByOrderByOperator Time\":53}]}]"
       },
   `
   
   debug info
   
   ```json
       "numServersQueried": 1,
       "numServersResponded": 1,
       "numSegmentsQueried": 29,
       "numSegmentsProcessed": 3,
       "numSegmentsMatched": 2,
       "numConsumingSegmentsQueried": 1,
       "numDocsScanned": 84,
       "numEntriesScannedInFilter": 1169794,
       "numEntriesScannedPostFilter": 84,
       "numGroupsLimitReached": false,
       "totalDocs": 150832761,
       "timeUsedMs": 57,
       "offlineThreadCpuTimeNs": 0,
       "realtimeThreadCpuTimeNs": 53524933,
       "numRowsResultSet": 10,
       "minConsumingFreshnessTimeMs": 1635351812064
   ```
   
   table configs (`time` is `serve_time` `id` is `user_id`):
   
   ```json
   "segmentsConfig": {
       "schemaName": "metrics",
       "retentionTimeUnit": "DAYS",
       "retentionTimeValue": "365",
       "timeColumnName": "serve_time",
       "replication": "1",
       "replicasPerPartition": "1"
     },
   "tableIndexConfig": {
       "bloomFilterColumns": [
           "user_id"
         ],
        "sortedColumn": [
           "user_id"
         ],
       "segmentPartitionConfig": {
         "columnPartitionMap": {
           "user_id": {
             "functionName": "Murmur",
             "numPartitions": 8
           }
         }
       },
   "routing": {
     "segmentPrunerTypes": [
       "partition"
     ]
    }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] lfernandez93 commented on issue #7642: numEntriesScannedInFilter > numDocsScanned in certain scenarios

Posted by GitBox <gi...@apache.org>.
lfernandez93 commented on issue #7642:
URL: https://github.com/apache/pinot/issues/7642#issuecomment-953099072


   thanks @kishoreg this explains a lot then. basically if `numEntriesScannedInFilter` is high means that selectivity is high and maybe we need some more pruning that number definitely wouldn't correlate to `numDocsScanned` which happens after we have done postFilter which will be the final result(?).
   
   I just got weird out cause as soon as I remove the `time` then the query is much faster and I was wondering why if I'm trying to select even less data with `time` I do see `numEntriesScannedInFilter` go to 0 after i remove the `time` clause.
   
   @richardstartin trace info
   
   `
       "traceInfo": {
           "pinot-server-1.pinot-server-headless.pinot.svc.cluster.local": "[{\"0\":[{\"GroupByOrderByCombineOperator Time\":53},{\"InstanceResponseOperator Time\":53}]},{\"0_0\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"AggregationGroupByOrderByOperator Time\":0}]},{\"0_1\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"ScanBasedFilterOperator Time\":0},{\"AndFilterOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"DocIdSetOperator Time\":0},{\"ProjectionOperator Time\":0},{\"PassThroughTransformOperator Time\":0},{\"AggregationGroupByOrderByOperator Time\":0}]},{\"0_2\":[{\"SortedIndexBasedFilterOperator Time\":0},{\"RangeFilterOperator Time\":53},{\"AndFilterOperator Time\":53},{\"DocIdSetOperato
 r Time\":53},{\"ProjectionOperator Time\":53},{\"PassThroughTransformOperator Time\":53},{\"AggregationGroupByOrderByOperator Time\":53}]}]"
       },
   `
   
   debug info
   
   ```json
       "numServersQueried": 1,
       "numServersResponded": 1,
       "numSegmentsQueried": 29,
       "numSegmentsProcessed": 3,
       "numSegmentsMatched": 2,
       "numConsumingSegmentsQueried": 1,
       "numDocsScanned": 84,
       "numEntriesScannedInFilter": 1169794,
       "numEntriesScannedPostFilter": 84,
       "numGroupsLimitReached": false,
       "totalDocs": 150832761,
       "timeUsedMs": 57,
       "offlineThreadCpuTimeNs": 0,
       "realtimeThreadCpuTimeNs": 53524933,
       "numRowsResultSet": 10,
       "minConsumingFreshnessTimeMs": 1635351812064
   ```
   
   table configs (`time` is `serve_time` `id` is `user_id`):
   
   ```json
   "segmentsConfig": {
       "schemaName": "ads_metrics",
       "retentionTimeUnit": "DAYS",
       "retentionTimeValue": "365",
       "timeColumnName": "serve_time",
       "replication": "1",
       "replicasPerPartition": "1"
     },
   "tableIndexConfig": {
       "bloomFilterColumns": [
           "user_id"
         ],
        "sortedColumn": [
           "user_id"
         ],
       "segmentPartitionConfig": {
         "columnPartitionMap": {
           "user_id": {
             "functionName": "Murmur",
             "numPartitions": 8
           }
         }
       },
   "routing": {
     "segmentPrunerTypes": [
       "partition"
     ]
    }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org