You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Pushpender Garg (Jira)" <ji...@apache.org> on 2020/10/06 14:46:00 UTC
[jira] [Updated] (HIVE-24237) Multi level/dimensional bucketing in Hive

     [ https://issues.apache.org/jira/browse/HIVE-24237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pushpender Garg updated HIVE-24237:
-----------------------------------
    Description: 
Hive can considerably optimize the execution of certain queries like filter, aggregations, joins, if bucketed columns are used in query for these operations. Buckets can be created on multiple columns as well where hash function is computed after merging all bucket columns. 

The problem is that if buckets are created on multiple columns but query is on subset of those columns then hive doesn't optimize that query. Unless all bucket columns are used as predicate, bucketing will not be utilized. Solution proposed is to solve this problem such that even if subset of bucket columns are used still hive will be able to optimize that query.

Instead of storing data in single dimensional buckets it can be stored in multi-dimensional buckets when multiple columns are given. If subset of bucketed columns is used as predicates in query then based on hash value of individual columns, appropriate buckets can be identified and only those buckets will be scanned. This will enable optimizations even when single column or few columns are used in querying  

  was:
Hive can considerably optimize the execution of certain queries like filter, aggregations, joins, if bucketed columns are used in query for these operations. Buckets can be created on multiple columns as well where hash function is computed after merging all bucket columns. 

The problem is that if buckets are created on multiple columns but query is on subset of those columns then hive doesn't optimize that query. Unless all bucket columns are used as predicate, bucketing will not be utilized. Solution proposed in this document is to solve this problem such that even if subset of bucket columns are used still hive will be able to optimize that query.

Instead of storing data in single dimensional buckets it can be stored in multi-dimensional buckets when multiple columns are given. If subset of bucketed columns is used as predicates in query then based on hash value of individual columns, appropriate buckets can be identified and only those buckets will be scanned. This will enable optimizations even when single column or few columns are used in querying  


> Multi level/dimensional bucketing in Hive
> -----------------------------------------
>
>                 Key: HIVE-24237
>                 URL: https://issues.apache.org/jira/browse/HIVE-24237
>             Project: Hive
>          Issue Type: New Feature
>          Components: Database/Schema
>    Affects Versions: 3.1.1, 3.1.2
>            Reporter: Pushpender Garg
>            Priority: Minor
>
> Hive can considerably optimize the execution of certain queries like filter, aggregations, joins, if bucketed columns are used in query for these operations. Buckets can be created on multiple columns as well where hash function is computed after merging all bucket columns. 
> The problem is that if buckets are created on multiple columns but query is on subset of those columns then hive doesn't optimize that query. Unless all bucket columns are used as predicate, bucketing will not be utilized. Solution proposed is to solve this problem such that even if subset of bucket columns are used still hive will be able to optimize that query.
> Instead of storing data in single dimensional buckets it can be stored in multi-dimensional buckets when multiple columns are given. If subset of bucketed columns is used as predicates in query then based on hash value of individual columns, appropriate buckets can be identified and only those buckets will be scanned. This will enable optimizations even when single column or few columns are used in querying  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)