You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2009/11/13 21:22:39 UTC

[jira] Created: (HIVE-931) Sorted Group By

Sorted Group By
---------------

                 Key: HIVE-931
                 URL: https://issues.apache.org/jira/browse/HIVE-931
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Namit Jain
            Assignee: He Yongqiang
             Fix For: 0.5.0


If the table is sorted by a given key, we don't use that for group by. That can be very useful.

For eg: if T is sorted by column c1,

For select c1, aggr() from T group by c1
we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.

This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780886#action_12780886 ] 

Namit Jain commented on HIVE-931:
---------------------------------

1. Add new parameter in hive-default.xml
2. Utilities.java: change function names - extractColumnNamesFromSortCols
   variable name: bucketCol: line 813
3. Remove all the tabs
4. Given	the fact that we are not doing	this optimization across sub-queries right now,
   would it be simpler to maintain the group by operator to table mapping via a separate walker
   instead of getting while generating the group by operator ? -- I am fine with	the current approach
   also, but just a question.
5. You are still doing partition pruning in GroupByOptimizer - why cant we reuse the mapping from
   ParseContext. That was the whole reason for storing it in ParseContext.


Sorry about being so picky...

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-12-01.patch

updated the patch. And put sort columns into consideration.

     * We use bucket columns only when the sorted column set is empty or the
     * sorted column set is an exact prefix match of bucket columns. For example, A
     * table is bucketed by column a,b, and c, and a query wants to group by
     * a,b,c. If the table's sort column is null, or is [a],[a,b], or [a,b,c],
     * we can use the 'sorted groupby' by looking at the bucket columns .
     * 
     * If we can can not determine by looking at bucketed columns and the table
     * has sort columns, we resort to sort columns. We can use bucket group by
     * if the groupby column set is an exact prefix match of sort columns.

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-11-20.3.patch

Thanks for the detailed comments, Namit!
bq.I think: bucketCols should be same as groupbyCols. not a superset.
done. Changed all the occurrence of sortedGroupby to bucketGroupby to avoid confusion. (At first we think we need to do sorted groupby, but more accurate what we did is bucket groupby)
bq. Also, change the name of the variable in isTableSortedbyColumns to bucketedCols instead of sortedCols
done.

bq. 1. Given the fact that partition pruning has already happened and stored in the parse context, can you use that information
instead of calling PartitionPruner.prune() again?
done. Actually partition pruning does not perform the actual prunning job at optimize phase. hive-931-2009-11-20.3.patch added a field in ParseContext to reuse results of PartitionPrunner.
bq. Instead of walking up the tree, can you collect the list of the tablescans before that group by ?
done.

Also added a testcase for subquery



> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-931.
-----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed. Thanks Yongqiang

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Optimize GROUP BY aggregations where key is a sorted/bucketed column

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-931:
--------------------------------

    Summary: Optimize GROUP BY aggregations where key is a sorted/bucketed column  (was: Sorted Group By)

> Optimize GROUP BY aggregations where key is a sorted/bucketed column
> --------------------------------------------------------------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-11-21.patch

Updated the patch. Thanks Namit!  
{quote}
5. You are still doing partition pruning in GroupByOptimizer - why cant we reuse the mapping from
ParseContext. That was the whole reason for storing it in ParseContext. 
{quote}
The mapping in ParseContext is null when reach GroupByOptimizer, this is because PartitionPruner does not obtain PartitionList at optimize phase, instead it only obtains partition predicates. 

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780463#action_12780463 ] 

Namit Jain commented on HIVE-931:
---------------------------------

1. Given the fact that partition pruning has already happened and stored in the parse context, can you use that information
instead of calling PartitionPruner.prune() again?
2. Instead of walking up the tree, can you collect the list of the tablescans before that group by ?
3. Can you add some more comments in GroupByOptimizer ?
4. I am not sure, but there seems to be a bug there:
  
    what about the case:

    (subq) followed by groupby, 

    are you taking the base tables of the subquery which may be different ?

Can you add tests for the above scenario ?

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784675#action_12784675 ] 

He Yongqiang commented on HIVE-931:
-----------------------------------

{quote}
Isnt there a bug here ?
group by foo(foo2(x))
You cant assume that the first child is a column - you should recurse till you get a column
{quote}

Thanks, Namit. I think i did not assume that the first child is a column, i just inserted all children exprs into the first place of groupByKeys list, and recurse.

Will add a testcase for this.

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784660#action_12784660 ] 

Namit Jain commented on HIVE-931:
---------------------------------

 } else if (node instanceof exprNodeGenericFuncDesc) { 	
				158 		exprNodeGenericFuncDesc udfNode = ((exprNodeGenericFuncDesc)node); 	
				159 		GenericUDF udf = udfNode.getGenericUDF(); 	
				160 		if(!FunctionRegistry.isDeterministic(udf)) 	
				161 		return; 	
				162 		groupByKeys.addAll(0, udfNode.getChildExprs());


Isnt there a bug here ?

group by foo(foo2(x))


You cant assume that the first child is a column - you should recurse till you get a column


> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785700#action_12785700 ] 

Namit Jain commented on HIVE-931:
---------------------------------

+1

looks good - will commit if the tests pass after https://issues.apache.org/jira/browse/HIVE-549

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-11-19.patch

Incorporates Namit's comments. Thanks, Namit!

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784662#action_12784662 ] 

Namit Jain commented on HIVE-931:
---------------------------------

Also can you add a test for the same.



> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781099#action_12781099 ] 

He Yongqiang commented on HIVE-931:
-----------------------------------

The mapping added in ParseContext will be updated by GroupByOptimizer, and reused when generating mr tasks.

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785789#action_12785789 ] 

Zheng Shao commented on HIVE-931:
---------------------------------

Is the "sorted by" property always the same as "bucketed by" for a table?

The name "hive.optimize.groupby" is a bit too general but I guess it's OK for now. Can we explain what is "bucketed group by" in the hive-default.xml? Users probably won't understand what it is.


> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785796#action_12785796 ] 

He Yongqiang commented on HIVE-931:
-----------------------------------

Hi Zheng,
bq. Is the "sorted by" property always the same as "bucketed by" for a table? 
They are not the same. And usually 'sorted by' is empty.

 bq. Can we explain what is "bucketed group by" in the hive-default.xml? Users probably won't understand what it is.
Yes, i should add more explanation for this. I am ok to add it in a new jira, or update it on wiki, or both. 


> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780221#action_12780221 ] 

Namit Jain commented on HIVE-931:
---------------------------------

Discussed with Yongqiang offline - this should be a optimization step instead of the current approach.
Since, at that time, partition pruning has also been performed

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-931) Sorted Group By

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780459#action_12780459 ] 

Namit Jain commented on HIVE-931:
---------------------------------

I think: bucketCols should be same as groupbyCols. not a superset.

Consider:

select ... group by a,b:

where data is bucketed by a,b,c.

A mapper might get:

a1   b1   c1
a1   b2   c2
a1   b1   c3


in which case the current algorithm might not work.


Also, change the name of the variable in isTableSortedbyColumns to bucketedCols instead of sortedCols

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-11-18.patch

hive-931-2009-11-18.patch adds two tests, one in positive tests, the other in negative tests.

> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-931) Sorted Group By

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-12-03.patch

Attached a new patch. Had a lot of offline discussions with Namit. Thanks Namit!

Finally, we changed to rule to,
we will transform a group by to a sort based group by when

1) If a table's sort columns are empty, and buckets columns contains and only contains all group by columns (order does not matter).

or

2)  If a table's sort columns are not empty, group by columns are a prefix subset of sort columns. 
For example, if sorted by a,b,c, group by 
a,
a,b
b,a
a,b,c
b,a,c ..
are all ok.



> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch, hive-931-2009-11-21.patch, hive-931-2009-12-01.patch, hive-931-2009-12-03.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.