You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2019/06/27 00:26:50 UTC

[GitHub] [incubator-pinot] siddharthteotia opened a new issue #4372: Support for single phase and two-phase distributed hash aggregation

siddharthteotia opened a new issue #4372: Support for single phase and two-phase distributed hash aggregation
URL: https://github.com/apache/incubator-pinot/issues/4372
 
 
   I am not entirely sure to what extent this is already supported -- so apologies if this is something that is already done or something not applicable
   
   Idea:
   
   GROUP BY processing using hash based aggregation in a distributed query engine can be done  using two different methods and each one has an impact on performance depending on cardinality
   
   (1) **Two-phase hash aggregation:** In the first phase, each node will do a local group by from the data (segment) on that node -- essentially computing partial resultset from a single thread query execution for a given segment at a given node.
   
   The physical plan or execution plan will have a shuffle above the aggregation group by operator -- the shuffle operator will use the partial aggregates and send them around such that each node now ends up with all the data (partial aggregates) corresponding to group by key(s). 
   
   The second phase of hash agg will now work on the partial aggregates and compute the final resultset for a given set of key(s) per node. Each node can now send its result set to broker. Broker only needs to union as opposed to doing any aggregation related processing.
   
   (2) **Single-phase hash aggregation**: The above described two phase method proves to be useful if the cardinality of group by key columns is low -- since doing local hash agg first achieves a higher degree of reduction and thus reduces the data sent around. However, if cardinality is very high then it is better to do single phase aggregation. In this case, below the aggregation operator, there will be a shuffle operator. Again, after hash agg, each node can send the final result set to broker and it just needs to union.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org