You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Laurent Martin (JIRA)" <ji...@apache.org> on 2016/05/20 14:00:19 UTC

[jira] [Updated] (HIVE-13802) Built-in aggregate functions may produce incorrect values when all values being aggregated in a group are NULL as the result of an expression

     [ https://issues.apache.org/jira/browse/HIVE-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laurent Martin updated HIVE-13802:
----------------------------------
    Description: 
With the Tez engine and Hive tables stored as ORC, built-in aggregate functions may produce incorrect values when all values being aggregated in a group are NULL as the result of an expression.

{code}
-- Test 1
-- The S column is populated as NaN

SET hive.execution.engine=tez; 
CREATE TABLE LM
(
    D STRING,
    X DOUBLE
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(X + 3) AS S
FROM LM
GROUP BY D;

-- Test 2
-- The S column will be populated as 1 (dangerous case!)

SET hive.execution.engine=tez; 
CREATE TABLE LM
(
    D STRING,
    X INT
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(X + 3) AS S
FROM LM
GROUP BY D;

-- Workaound:
-- According to my tests, a workaround is to surround the nullable expression with
-- COALESCE. Example:

CREATE TABLE LM
(
    D STRING,
    X INT
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(COALESCE(X + 3)) AS S
FROM LM
GROUP BY D;
{code}


  was:
With the Tez engine and Hive tables stored as ORC, built-in aggregate functions may produce incorrect values when all values being aggregated in a group are NULL as the result of an expression.

-- Test 1
-- The S column is populated as NaN

SET hive.execution.engine=tez; 
CREATE TABLE LM
(
    D STRING,
    X DOUBLE
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(X + 3) AS S
FROM LM
GROUP BY D;

-- Test 2
-- The S column will be populated as 1 (dangerous case!)

SET hive.execution.engine=tez; 
CREATE TABLE LM
(
    D STRING,
    X INT
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(X + 3) AS S
FROM LM
GROUP BY D;

-- Workaound:
-- According to my tests, a workaround is to surround the nullable expression with
-- COALESCE. Example:

CREATE TABLE LM
(
    D STRING,
    X INT
)
STORED AS ORC;

INSERT INTO TABLE LM
VALUES
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-11',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL),
('2016-05-12',NULL);

SELECT D, MIN(COALESCE(X + 3)) AS S
FROM LM
GROUP BY D;



> Built-in aggregate functions may produce incorrect values when all values being aggregated in a group are NULL as the result of an expression
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13802
>                 URL: https://issues.apache.org/jira/browse/HIVE-13802
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, ORC, Tez
>    Affects Versions: 0.14.0
>            Reporter: Laurent Martin
>
> With the Tez engine and Hive tables stored as ORC, built-in aggregate functions may produce incorrect values when all values being aggregated in a group are NULL as the result of an expression.
> {code}
> -- Test 1
> -- The S column is populated as NaN
> SET hive.execution.engine=tez; 
> CREATE TABLE LM
> (
>     D STRING,
>     X DOUBLE
> )
> STORED AS ORC;
> INSERT INTO TABLE LM
> VALUES
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL);
> SELECT D, MIN(X + 3) AS S
> FROM LM
> GROUP BY D;
> -- Test 2
> -- The S column will be populated as 1 (dangerous case!)
> SET hive.execution.engine=tez; 
> CREATE TABLE LM
> (
>     D STRING,
>     X INT
> )
> STORED AS ORC;
> INSERT INTO TABLE LM
> VALUES
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL);
> SELECT D, MIN(X + 3) AS S
> FROM LM
> GROUP BY D;
> -- Workaound:
> -- According to my tests, a workaround is to surround the nullable expression with
> -- COALESCE. Example:
> CREATE TABLE LM
> (
>     D STRING,
>     X INT
> )
> STORED AS ORC;
> INSERT INTO TABLE LM
> VALUES
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-11',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL),
> ('2016-05-12',NULL);
> SELECT D, MIN(COALESCE(X + 3)) AS S
> FROM LM
> GROUP BY D;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)