You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Seonggon Namgung (Jira)" <ji...@apache.org> on 2023/05/18 08:46:00 UTC

[jira] [Created] (HIVE-27357) Map-side SMB Join returns incorrect result when it 2 tables have different bucket size

Seonggon Namgung created HIVE-27357:
---------------------------------------

             Summary: Map-side SMB Join returns incorrect result when it 2 tables have different bucket size
                 Key: HIVE-27357
                 URL: https://issues.apache.org/jira/browse/HIVE-27357
             Project: Hive
          Issue Type: Bug
    Affects Versions: 4.0.0-alpha-2, 3.1.3
            Reporter: Seonggon Namgung
            Assignee: Seonggon Namgung


The following query returns \{(1, 1), (2, 2), (7, 7)} instead of \{(1, 1), (2, 2), (7, 7), (6, 6), (14, 14), (11, 11)}.

 

 
{code:java}
set hive.strict.checks.bucketing=true;
set hive.auto.convert.join=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.join.noconditionaltask.size=1;
set hive.optimize.dynamic.partition.hashjoin=false;

DROP TABLE IF EXISTS bucket2;
CREATE TABLE bucket2(key string, value string) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS;
DROP TABLE IF EXISTS bucket3;
CREATE TABLE bucket3(key string, value string) CLUSTERED BY (key) SORTED BY (key) INTO 3 BUCKETS;

INSERT INTO TABLE bucket2 VALUES (1, 1), (2, 2), (7, 7), (6, 6), (14, 14), (11, 11);
INSERT INTO TABLE bucket3 VALUES (1, 1), (2, 2), (7, 7), (6, 6), (14, 14), (11, 11);

SELECT * FROM bucket2 JOIN bucket3 on bucket2.key = bucket3.key; {code}
 

 

It is known that sort-merge join is used when two tables have the same number of buckets, but I could not find such restriction from the source code. Also, current Hive uses map side SMB join for the above query, which joins 2 buckets table and 3 buckets table. So I'm planning to fix this issue not by using another Join algorithms.

Originally, we found this issue by running auto_sortmerge_join_12.q with hive.strict.checks.bucketing=true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)