You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Vikram Dixit K (JIRA)" <ji...@apache.org> on 2015/02/13 20:45:12 UTC

[jira] [Updated] (HIVE-9523) when columns on which tables are partitioned are used in the join condition same join optimizations as for bucketed tables should be applied

     [ https://issues.apache.org/jira/browse/HIVE-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikram Dixit K updated HIVE-9523:
---------------------------------
    Labels: gsoc2015  (was: )

> when columns on which tables are partitioned are used in the join condition same join optimizations as for bucketed tables should be applied
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-9523
>                 URL: https://issues.apache.org/jira/browse/HIVE-9523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer, SQL
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1
>            Reporter: Maciek Kocon
>              Labels: gsoc2015
>
> For JOIN conditions where partitioning criteria are used respectively:
>             ⋮ 
> FROM TabA JOIN TabB
>    ON TabA.partCol1 = TabB.partCol2
>    AND TabA.partCol2 = TabB.partCol2
> the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮ 
> FROM TabC
>   JOIN TabD
>      ON TabC.clusteredByCol1 = TabD.clusteredByCol2
>    AND TabC.clusteredByCol2 = TabD.clusteredByCol2
> and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.
> This is based on fact that same way as buckets translate to separate files, the partitions essentially provide the same mapping.
> When data locality is known the optimizer could focus only on joining corresponding partitions rather than whole data sets.
> #side notes:
> ⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the same time is allowed:
> CREATE TABLE
>  ⋮
> PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS;
> But in this case optimizer never chooses to use Bucket Map Join or Sort Merge Bucket Map Join which defeats the purpose of creating BUCKETed tables in such scenarios. Should that be raised as a separate BUG?
> ⦿ Currently partitioning and bucketing are two separate things but serve same purpose - shouldn't the concept be merged (explicit/implicit partitions?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)