You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/01/19 08:05:28 UTC

[GitHub] [pinot] siddharthteotia opened a new issue #8039: Improve accuracy of group by without order by queries

siddharthteotia opened a new issue #8039:
URL: https://github.com/apache/pinot/issues/8039


   Currently GROUP BY queries without ORDER BY can generate very inaccurate results since each server will keep at max N (N coming from LIMIT N) groups which are randomly selected and there is no resize/trimming unlike ORDER BY.
   
   An easier way to handle this would be to add implicit ORDER BY on GROUP BY and/or agg columns if there is no ORDER BY in the query. This will allow us to reuse to current ORDER BY code path which is more accurate. This will provide same levels of accuracy and determinism as current GROUP BY with ORDER BY
   
   If we want to improve accuracy without ordering results, then some changes in TableResizer might be needed. We will continue to accumulate more records (upto `trimThreshold` like in ORDER BY code) but resizer won't sort when trimming to `trimSize`. It can simply evict `trimThreshold - trimSize` records without worrying about order. While this will improve the accuracy, the result won't be deterministic. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #8039: Improve accuracy of group by without order by queries

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #8039:
URL: https://github.com/apache/pinot/issues/8039#issuecomment-1016733015


   This optimization can also be applied to cases where order by is only applied to group by columns. No need to keep extra groups in such case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #8039: Improve accuracy of group by without order by queries

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #8039:
URL: https://github.com/apache/pinot/issues/8039#issuecomment-1016727873


   I've also thought about this problem, and leaning towards the first approach because without ordering, the second approach might also not give accurate results.
   The cheapest solution that can give deterministic results should be always order by all the group by columns, and only keep `limit` groups per server. On the broker we can do merge sort and return once `limit` groups are reached. We cannot order by aggregate results because the value can change during aggregation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org