You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Luis Marmolejo (JIRA)" <ji...@apache.org> on 2016/03/29 20:43:25 UTC

[jira] [Commented] (HIVE-11525) Bucket pruning

    [ https://issues.apache.org/jira/browse/HIVE-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216604#comment-15216604 ] 

Luis Marmolejo commented on HIVE-11525:
---------------------------------------

Does this fix applies only to ORC files on TEZ?
The implementation relies on extracting bucket number by parsing the hdfs bucket file (i.e. requiring HDFS directory read), correct? 
Does this mean that there are no HDFS read operations on a bucket file UNLESS that bucket is not pruned?



> Bucket pruning
> --------------
>
>                 Key: HIVE-11525
>                 URL: https://issues.apache.org/jira/browse/HIVE-11525
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.2.0, 1.1.0, 1.3.0, 2.0.0
>            Reporter: Maciek Kocon
>            Assignee: Gopal V
>              Labels: TODOC2.0
>             Fix For: 2.0.0
>
>         Attachments: HIVE-11525.1.patch, HIVE-11525.2.patch, HIVE-11525.3.patch, HIVE-11525.WIP.patch
>
>
> Logically and functionally bucketing and partitioning are quite similar - both provide mechanism to segregate and separate the table's data based on its content. Thanks to that significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is open/explicit while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current RDBMS while BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features available in Hive there should be nothing to prevent leveraging same existing query/join optimisations across the two.
> BUCKET pruning
> Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables
> Simplest example is for queries like:
> "SELECT … FROM x WHERE colA=123123"
> to read only the relevant bucket file rather than all file-buckets that belong to a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)