You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2013/06/22 02:05:20 UTC

[jira] [Created] (HIVE-4781) LEFT SEMI JOIN generates wrong results when

Yin Huai created HIVE-4781:
------------------------------

             Summary: LEFT SEMI JOIN generates wrong results when 
                 Key: HIVE-4781
                 URL: https://issues.apache.org/jira/browse/HIVE-4781
             Project: Hive
          Issue Type: Bug
    Affects Versions: 0.12.0
            Reporter: Yin Huai
            Assignee: Yin Huai


Suppose that we have a query shown below
{code:sql}
SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
{\code}

When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output.

Let's say t1 is
{code}
key
----
1
{\code}
and t2 is
{code}
key
----
1
1
1
1
{\code}

When hive.join.emit.interval=1, the output of above query will be
{code}
1
1
1
1
{\code}
The correct result should be 
{code}
1
{\code}

This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys.

Please apply the patch 'wrong_semi_join.txt' attached below and use 
{code}
ant test -Dtestcase=TestMinimrCliDriver -Dqfile="left_semi_join.q" -Dtest.silent=false
{\code} to replay the problem. The wrong result can be found in 
{code}
<hive_root_dir>/build/ql/test/logs/clientpositive
{\code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira