You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jan DolinĂ¡r <do...@gmail.com> on 2014/01/23 10:27:34 UTC
Multi-group-by with transform leads to incorrect optimization
Hello,
I've encountered an issue with hive's predicate push down optimization when
multi-group-by is used together with transform. Here is a simple testcase
to illustrate my point:
CREATE TABLE IF NOT EXISTS my_table (
id INT,
property1 INT,
property2 INT,
count INT
);
EXPLAIN
FROM (
SELECT TRANSFORM(
id,
property1,
property2,
count
) USING 'cat' AS (
id INT,
property1 INT,
count INT,
property2 INT
)
FROM my_table
) t
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test1'
SELECT id, property1, SUM(count)
GROUP BY id, property1
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test2'
SELECT id, property2, SUM(count)
WHERE property1 != 0
GROUP BY id, property2;
When hive.optimize.ppd = true, hive moves the where clause from second
select all the way down into the transform operator, which is obviously
wrong, because it affects the first select as well. With
hive.optimize.ppd=false everything works as expected. Without the
transform, it works correctly as well.
I see this problem with Hive version 0.10.0 (cdh4.4.0). With Hive 0.7.1 the
same query behaves correctly, regardless of hive.optimize.ppd settings. So
it seems as a bug introduced with some ppd improvements in 0.8 or later.
Can anyone confirm if this is still broken in newest versions? If it
doesn't work with 0.12, I'll file a new issue in JIRA.
Best regards,
Jan Dolinar