You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jan Dolinár <do...@gmail.com> on 2014/01/23 10:27:34 UTC

Multi-group-by with transform leads to incorrect optimization

Hello,

I've encountered an issue with hive's predicate push down optimization when
multi-group-by is used together with transform. Here is a simple testcase
to illustrate my point:

CREATE TABLE IF NOT EXISTS my_table (
        id INT,
        property1 INT,
        property2 INT,
        count INT
);

EXPLAIN
FROM (
    SELECT TRANSFORM(
        id,
        property1,
        property2,
        count
    ) USING 'cat' AS (
        id INT,
        property1 INT,
        count INT,
        property2 INT
    )
    FROM my_table
) t
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test1'
SELECT id, property1, SUM(count)
GROUP BY id, property1
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test2'
SELECT id, property2, SUM(count)
WHERE property1 != 0
GROUP BY id, property2;

When hive.optimize.ppd = true, hive moves the where clause from second
select all the way down into the transform operator, which is obviously
wrong, because it affects the first select as well. With
hive.optimize.ppd=false everything works as expected. Without the
transform, it works correctly as well.

I see this problem with Hive version 0.10.0 (cdh4.4.0). With Hive 0.7.1 the
same query behaves correctly, regardless of hive.optimize.ppd settings. So
it seems as a bug introduced with some ppd improvements in 0.8 or later.
Can anyone confirm if this is still broken in newest versions? If it
doesn't work with 0.12, I'll file a new issue in JIRA.

Best regards,
Jan Dolinar