You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/04/06 02:00:18 UTC

Apache Pinot Daily Email Digest (2021-04-05)

### _#general_

  
 **@maheshda:** @maheshda has joined the channel  
 **@julius.syvanen:** @julius.syvanen has joined the channel  
 **@y.a:** @y.a has joined the channel  
 **@nkarandikar:** @nkarandikar has joined the channel  
 **@lakshmanan.velusamy:** @lakshmanan.velusamy has joined the channel  

###  _#random_

  
 **@maheshda:** @maheshda has joined the channel  
 **@julius.syvanen:** @julius.syvanen has joined the channel  
 **@y.a:** @y.a has joined the channel  
 **@nkarandikar:** @nkarandikar has joined the channel  
 **@lakshmanan.velusamy:** @lakshmanan.velusamy has joined the channel  

###  _#troubleshooting_

  
 **@maheshda:** @maheshda has joined the channel  
 **@julius.syvanen:** @julius.syvanen has joined the channel  
 **@y.a:** @y.a has joined the channel  
 **@nkarandikar:** @nkarandikar has joined the channel  
 **@lakshmanan.velusamy:** @lakshmanan.velusamy has joined the channel  
 **@lakshmanan.velusamy:** Hi Community, I am fairly new to pinot. Does the
star tree index help when we have aggregate queries with time range (and
exclusion) filters ?  
**@lakshmanan.velusamy:** Here is the query: ```SELECT
created_at_1_hour_seconds as time_col, SUM(id) as total FROM table WHERE
(created_at_seconds BETWEEN 1615924566 AND 1617134166) AND ((field1 <> 'null')
AND (entity_id NOT IN ('uuid1', 'uuid2', 'uuid3')) GROUP BY time_col ORDER BY
time_col ASC ``` table/index config: ```indexSpec: starTreeIndexConfigs: \-
dimensionsSplitOrder: \- created_at_1_hour_seconds \- created_at_seconds \-
field1 \- entity_id functionColumnPairs: \- function: SUM column: id
skipStarNodeCreationForDimensions: \- created_at_seconds bloomFilterColumns:
\- field1 \- entity_id```  
**@jackie.jxt:** It helps if the cardinality of the columns are relatively low  
**@lakshmanan.velusamy:** Not sure if created_at_seconds should be on the star
tree index dimension split order as the cardinality is very high (timestamps
are millisecs granularity).  
**@jackie.jxt:** In order to use star-tree to solve queries with filter on it,
it needs to be included in the split order  
**@jackie.jxt:** `created_at_seconds` should be second granularity right?  
**@lakshmanan.velusamy:** my bad, yeah its seconds.  
**@jackie.jxt:** Cardinality wise it should be fine. How about `entity_id`?  
**@jackie.jxt:** I feel it's cardinality is going to be high  
**@lakshmanan.velusamy:** should created_at_seconds be on the dimension order
given the high Cardinality ?  
**@lakshmanan.velusamy:** entity_id has high cardinality, but total_records
(in millions) >> total_entities (in 10s of thousands).  
**@jackie.jxt:** Ok, so both `entity_id` and `created_at` would have
cardinality of 10s thousands per segment  
**@jackie.jxt:** I would recommend not including `created_at_seconds` in the
split order in this case. Segments that are fully covered in the time range
will use the star-tree index. Segments that are partially covered will fall
back to the non-aggregated records.  
**@jackie.jxt:** Then splitOrder will be `created_at_1_hour_seconds, field1,
entity_id`  
**@lakshmanan.velusamy:** Got it, we have this in the star tree index config
as well: ``` skipStarNodeCreationForDimensions: \- created_at_seconds```  
**@lakshmanan.velusamy:** thanks for the reply @jackie.jxt!  
**@jackie.jxt:** Yeah, you may remove it from the skip list as well  
**@lakshmanan.velusamy:** We got an another query with more complex
aggregations: ```SELECT dimension_uuid as dimension, AVG(total) AS avg_total,
SUM(total)/DistinctCount(entity_id) AS total_per_entity, COUNT(order_id) AS
order_count, SUM(total) AS total_amount,
COUNT(order_id)/DistinctCount(entity_id) AS orders_per_entity,
DISTINCTCOUNT(entity_id) AS entity_count FROM table WHERE (created_at_seconds
BETWEEN 1617049575 AND 1617654375) AND (field1 <> 'null') AND (entity_id NOT
IN ('uuid1', 'uuid2', 'uuid3')) GROUP BY dimension ORDER BY order_count DESC
LIMIT 50```  
**@lakshmanan.velusamy:** this one has the same set of filters, but does a
bunch of aggregations, including DISTINCTCOUNT. Is there a chance to improve
performance using star tree index at all for the aggregations ?  
**@jackie.jxt:** Here you can find all the supported functions:  
**@jackie.jxt:** If you need accurate distinct (`distinctcount` instead of
`distinctcounthll`), then it cannot be supported by star-tree due to the risk
of storage explosion  
**@jackie.jxt:** For the second query, you need to put both `dimension_uuid`
and `entity_id` into the split list. The performance of star-tree comes from
the pre-aggregation of the records, and I'm not sure if we can get much pre-
aggregation with these 2 high cardinality dimensions  
**@lakshmanan.velusamy:** Was skeptical about DISTINCTCOUNT as it mentioned
that is not supported due to storage explosion problem, will look into the
possibility of using distinctcounthll and also measure the performance with
and without index to see if there is a difference.  
**@lakshmanan.velusamy:** Any tool/command to understand the effect of the
index when processing the query ?  
**@jackie.jxt:** There is no in-built tool for that. For experiment, you can
set up 2 tables with the same data, one with index and one without, and query
them separately to compare the throughput and latency  
**@lakshmanan.velusamy:** Got it, thats the pretty much the setup we got,
along with checking the query stats (especially `numEntriesScannedPostFilter`
in the response stats to see the impact of star tree filter on aggregations)
for the same query with and without index, along with latency.  
**@lakshmanan.velusamy:** Is there an accuracy (standard error) expectation on
the distinctcounthll algorithm used in pinot ?  
**@jackie.jxt:** We expect ~2% standard error for HyperLogLog. Reference:  
**@lakshmanan.velusamy:** is it 2% for low value metrics as well ? Any plans
to implement something like  ?  
**@jackie.jxt:** Based on my understanding, it will be quite accurate with few
values  
**@jackie.jxt:** The HLL Sketch from `DataSketches` seems promising:  
**@jackie.jxt:** We don't have it supported yet. It should not be hard to add.
Contributions are very welcome.  

###  _#segment-write-api_

  
 **@npawar:** SegmentUploader impl:  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org