You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2013/12/20 01:25:09 UTC
[jira] [Updated] (HIVE-6057) Enable bucketed sorted merge joins of
arbitrary subqueries
[ https://issues.apache.org/jira/browse/HIVE-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang updated HIVE-6057:
------------------------------
Description:
Currently, you cannot use bucketed SMJ when joining subquery results. It would make sense to be able to explicitly specify bucketing of the intermediate output from a subquery to enable bucketed SMJ.
For example, the following query will NOT use bucketed SMJ:
(gameends and dummymapping are clustered and sorted by hashid into 128 buckets)
{code}
select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid) e join dummymapping m on e.hashid=m.hashid
Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable bucketed SMJ:
select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m on e.hashid=m.hashid
{code}
was:
Currently, you cannot use bucketed SMJ when joining subquery results. It would make sense to be able to explicitly specify bucketing of the intermediate output from a subquery to enable bucketed SMJ.
For example, the following query will NOT use bucketed SMJ:
(gameends and dummymapping are clustered and sorted by hashid into 128 buckets)
select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid) e join dummymapping m on e.hashid=m.hashid
Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable bucketed SMJ:
select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m on e.hashid=m.hashid
> Enable bucketed sorted merge joins of arbitrary subqueries
> ----------------------------------------------------------
>
> Key: HIVE-6057
> URL: https://issues.apache.org/jira/browse/HIVE-6057
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Affects Versions: 0.12.0
> Reporter: Jan-Erik Hedbom
> Priority: Minor
>
> Currently, you cannot use bucketed SMJ when joining subquery results. It would make sense to be able to explicitly specify bucketing of the intermediate output from a subquery to enable bucketed SMJ.
> For example, the following query will NOT use bucketed SMJ:
> (gameends and dummymapping are clustered and sorted by hashid into 128 buckets)
> {code}
> select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid) e join dummymapping m on e.hashid=m.hashid
> Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable bucketed SMJ:
> select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m on e.hashid=m.hashid
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)