You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rajat Khandelwal <pr...@apache.org> on 2019/04/23 10:39:45 UTC

Filters with IN clause are getting omitted

Hi

I've recently noticed incorrect behaviour from Hive Query Planner. The
simplest example I could construct is as follows

SELECT
tbl3.col2 AS current_regularity_streak
FROM (select 1 col1) tbl1
LEFT JOIN
(select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
WHERE
tbl1.col1 in (select 1 col1 union all select 2)
AND
tbl3.col2 >= 2


The query should logically return no rows, but it does! It returns 1 row
with 1 column, with value = 1. The value=1 should have been filtered out by
the filter tbl3.col2 >= 2
  So  df

On further examination, I believe the culprit is the IN clause. If I remove
this, the query works correctly and returns 0 rows.

SELECT
tbl3.col2 AS current_regularity_streak
FROM (select 1 col1) tbl1
LEFT JOIN
(select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
WHERE
tbl3.col2 >= 2

Is this a known issue? I couldn't find anything on JIRA/Stack
overflow/Google.

I further analyzed using EXPLAIN FORMATTED and noticed that the plan of the
first query doesn't contain the >=2 predicate. The plan of the second query
does. I wonder how the planner could omit the filter clause altogether?

I'm using HDP 2.6.5.10-2.

Re: Filters with IN clause are getting omitted

Posted by Sungwoo Park <gl...@gmail.com>.
Not  solution to the problem on HDP 2.6.5, but I have tested the first
script in Hive 2.3.4 and Hive 3.1.1. On Hive 2.3.4, it returns 1 row, and
on Hive 3.1.1, it returns no row. So, I guess the bug is still in HDP 2.6.5.

--- Sungwoo

On Tue, Apr 23, 2019 at 7:40 PM Rajat Khandelwal <pr...@apache.org> wrote:

> Hi
>
> I've recently noticed incorrect behaviour from Hive Query Planner. The
> simplest example I could construct is as follows
>
> SELECT
> tbl3.col2 AS current_regularity_streak
> FROM (select 1 col1) tbl1
> LEFT JOIN
> (select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
> LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
> WHERE
> tbl1.col1 in (select 1 col1 union all select 2)
> AND
> tbl3.col2 >= 2
>
>
> The query should logically return no rows, but it does! It returns 1 row
> with 1 column, with value = 1. The value=1 should have been filtered out by
> the filter tbl3.col2 >= 2
>   So  df
>
> On further examination, I believe the culprit is the IN clause. If I
> remove this, the query works correctly and returns 0 rows.
>
> SELECT
> tbl3.col2 AS current_regularity_streak
> FROM (select 1 col1) tbl1
> LEFT JOIN
> (select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
> LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
> WHERE
> tbl3.col2 >= 2
>
> Is this a known issue? I couldn't find anything on JIRA/Stack
> overflow/Google.
>
> I further analyzed using EXPLAIN FORMATTED and noticed that the plan of
> the first query doesn't contain the >=2 predicate. The plan of the second
> query does. I wonder how the planner could omit the filter clause
> altogether?
>
> I'm using HDP 2.6.5.10-2.
>