You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Remus Rusanu (JIRA)" <ji...@apache.org> on 2014/03/08 21:55:42 UTC

[jira] [Commented] (HIVE-6222) Make Vector Group By operator abandon grouping if too many distinct keys

    [ https://issues.apache.org/jira/browse/HIVE-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925001#comment-13925001 ] 

Remus Rusanu commented on HIVE-6222:
------------------------------------

The 1.patch refactors the VectorGroupByOperator to delegate the algorithm used to a nested processingMode object. Three processing modes are provided:

 - global aggregate. This is the trivial mode when there are no keys. All values are aggregated into a single row of aggregation buffers and the values are emitted at operator closeOp()
 - hash aggregate. This is all the previous VGBy operator logic,with hash table and including memory pressure flushes
 - streaming aggregate. This mode aggregates intermediate values as keys change in the input and flushes at each key value change. It relies on MR shuffle and row-mode GBy reduce phase to merge the intermediate values. Due to the way aggregators operate on batches, the logic of flushing is not strictly 'on new key' but 'for all new keys in a batch, except last'. Identical Identical keys in a batch are not aggregated, unless they make a contiguous run.

This patch will conflict with HIVE-6518 because the relevant code is moved into the new nested ProcessingModeHashAggregate class. Porting the fix is trivial. I will rebase either this or HIVE-6518 depending which gets committed first.

> Make Vector Group By operator abandon grouping if too many distinct keys
> ------------------------------------------------------------------------
>
>                 Key: HIVE-6222
>                 URL: https://issues.apache.org/jira/browse/HIVE-6222
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>            Priority: Minor
>         Attachments: HIVE-6222.1.patch
>
>
> Row mode GBY is becoming a pass-through if not enough aggregation occurs on the map side, relying on the shuffle+reduce side to do the work. Have VGBY do the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)