You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2009/02/06 02:17:59 UTC

[jira] Commented: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by

    [ https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670973#action_12670973 ] 

Namit Jain commented on HIVE-223:
---------------------------------

Specifying column specific skew at a column level for the user may be very difficult:

I would instead propose the following.

Have 2 parameters: cardinality and skew.

Cardinality determines whether to use map-side aggregates or not.
Skew determines whether to use 1 or 2 map-reduce jobs.

They are independent of each other.

We can use the one existing parameter (hive.map.aggr) and add another one (hive.groupby.skewindata).

It would be better to make them hints, so that they are query block specific, but that can be done later, since in general, we have been avoiding using hints.

As far as the plans go: we have

1. map-side aggregations with 1 map-reduce job:

group by  grouping + distinct key on mapper.
group by/sort by grouping + distinct key


2. sort based aggregations with 1 map-reduce job:  

group by/sort by grouping + distinct key


3. map-side aggregations with 2 map-reduce jobs:

   0. group by/sort by grouping + distinct key (distinct == null if no distinct present)
   a. group by/sort by grouping + distinct key (distinct == random if no distinct present)
   b. group by/sort by grouping key

4. sort based aggregations with 2 map-reduce jobs:

   a. group by/sort by grouping + distinct key (distinct == random if no distinct present)
   b. group by/sort by grouping key



> when using map-side aggregates - perform single map-reduce group-by
> -------------------------------------------------------------------
>
>                 Key: HIVE-223
>                 URL: https://issues.apache.org/jira/browse/HIVE-223
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Namit Jain
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. however - the reason for doing multiple map-reduce group-bys (for single group-bys) was the fear of skews. When we are doing map side aggregates - skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates should take care of this
> - badness in hash function that sends too much stuff to one reducer - we should be able to take care of this by having good hash functions (and prime number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.