You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2015/11/04 22:19:27 UTC
[jira] [Commented] (HIVE-12334) Partition Map Join
[ https://issues.apache.org/jira/browse/HIVE-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990441#comment-14990441 ]
Gopal V commented on HIVE-12334:
--------------------------------
This is an intriguing idea, with the layout partitioning that can be expanded as standard SQL unions?
{code}
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key and a.key
{code}
becomes
{code}
SELECT a.key, a.value from a,b where a.key = b.key and a.key = "20151208" and b.key = "20151208"
UNION ALL
SELECT a.key, a.value from a,b where a.key = b.key and a.key = "20151209" and b.key = "20151309"
...
{code}
for the set intersection of a.key & b.key?
> Partition Map Join
> ------------------
>
> Key: HIVE-12334
> URL: https://issues.apache.org/jira/browse/HIVE-12334
> Project: Hive
> Issue Type: Improvement
> Components: Logical Optimizer, Physical Optimizer, SQL
> Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
> Reporter: Maciek Kocon
> Labels: gsoc2015
>
> Logically and functionally bucketing and partitioning are quite similar - both provide mechanism to segregate and separate the table's data based on its content. Thanks to that significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is open/explicit while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current RDBMS while BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features available in Hive there should be nothing to prevent leveraging same existing query/join optimisations across the two.
> PARTITION MAPJOIN
> Use the same type of optimization as in BUCKETED MAP JOIN for PARTITIONED tables.
> The partition map join could be performed if the tables being joined are partitioned on the join columns.
> If table A has set partitioning on KEY column and table B is partitioned on KEY column, the following join
> SELECT /*+ MAPJOIN(b) */ a.key, a.value
> FROM a JOIN b ON a.key = b.key
> can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required partitions are fetched. For the query above, the mapper processing partition key='20151208' for A will only fetch partition for key='20151208' of B.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)