You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Aditya Allamraju (JIRA)" <ji...@apache.org> on 2017/08/18 10:15:00 UTC
[jira] [Updated] (HIVE-17355) Casting to Decimal along with UNION ALL gives incosistent results

     [ https://issues.apache.org/jira/browse/HIVE-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aditya Allamraju updated HIVE-17355:
------------------------------------
    Description: 
Extra trailing zeros are added when running "union all" on the tables containing decimal data types.

Version: Hive 2.1

Steps to repro:-
================

1) CREATE TABLE `decisample`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample'

2) CREATE TABLE `decisample3`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample3'

3)hive> select * from decisample3;
OK
1.00    2       3.00
7.00    8       9.00

4)hive> select * from decisample;
OK
4.00    5       6.00

5) query:- 
select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2;

o/p:-
OK
4.00            NULL
1.00    2       3.000000000000000000
7.00    8       9.000000000000000000
Time taken: 87.993 seconds, Fetched: 3 row(s)

6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1;

o/p:-
4.00
1.00    2       3
7.00    8       9

Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing  zero.

Observation:

1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required.

From the explain plan, type casting is not consistent when done 2 different ways:
a)  select-1  UNION ALL select-2 (Query-5 in above comment)
vs
b) select-2 UNION ALL select-2   (Query-6 in above comment)

Showing only the "expresssions" part of execution plans

Query-5:
========
..
..
Map Operator Tree:
          TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18))
              outputColumnNames: _col0, _col1, _col2
..
..
TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18))


Query-6:
========
..
..
Map Operator Tree:
          TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string)
..
..
TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string)
..
..

Attaching the execution plans for both queries for reference.

2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros.

3. This is repeating every time implicit type casting is happening on EMPTY strings.

If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns.


For ex:
Q1:

select a,b,c from decisample3
union all
select a,cast(' ' as int),cast(' ' as decimal) from decisample;

Q2:

select a,cast(' ' as int),cast(' ' as decimal) from decisample
union all
select a,b,c from decisample3;

Both the above queries will give consistent result now.

cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"

  was:
Extra trailing zeros are added when running "union all" on the tables containing decimal data types.

Env:-
=====
mapr-hivemetastore-2.1.201703241741-1.noarch
mapr-hive-2.1.201703241741-1.noarch
mapr-hiveserver2-2.1.201703241741-1.noarch

Steps to repro:-
================

1) CREATE TABLE `decisample`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample'

2) CREATE TABLE `decisample3`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample3'

3)hive> select * from decisample3;
OK
1.00    2       3.00
7.00    8       9.00

4)hive> select * from decisample;
OK
4.00    5       6.00

5) query:- 
select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2;

o/p:-
OK
4.00            NULL
1.00    2       3.000000000000000000
7.00    8       9.000000000000000000
Time taken: 87.993 seconds, Fetched: 3 row(s)

6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1;

o/p:-
4.00
1.00    2       3
7.00    8       9

Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing  zero.

Observation:

1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required.

From the explain plan, type casting is not consistent when done 2 different ways:
a)  select-1  UNION ALL select-2 (Query-5 in above comment)
vs
b) select-2 UNION ALL select-2   (Query-6 in above comment)

Showing only the "expresssions" part of execution plans

Query-5:
========
..
..
Map Operator Tree:
          TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18))
              outputColumnNames: _col0, _col1, _col2
..
..
TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18))


Query-6:
========
..
..
Map Operator Tree:
          TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string)
..
..
TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string)
..
..

Attaching the execution plans for both queries for reference.

2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros.

3. This is repeating every time implicit type casting is happening on EMPTY strings.

If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns.


For ex:
Q1:

select a,b,c from decisample3
union all
select a,cast(' ' as int),cast(' ' as decimal) from decisample;

Q2:

select a,cast(' ' as int),cast(' ' as decimal) from decisample
union all
select a,b,c from decisample3;

Both the above queries will give consistent result now.

cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"


> Casting to Decimal along with UNION ALL gives incosistent results
> -----------------------------------------------------------------
>
>                 Key: HIVE-17355
>                 URL: https://issues.apache.org/jira/browse/HIVE-17355
>             Project: Hive
>          Issue Type: Bug
>          Components: Parser, UDF
>    Affects Versions: 2.1.0, 2.1.1
>         Environment: CentOS 7.2
>            Reporter: Aditya Allamraju
>
> Extra trailing zeros are added when running "union all" on the tables containing decimal data types.
> Version: Hive 2.1
> Steps to repro:-
> ================
> 1) CREATE TABLE `decisample`(
>   `a` decimal(8,2),
>   `b` int,
>   `c` decimal(5,2))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/user/hive/warehouse/decisample'
> 2) CREATE TABLE `decisample3`(
>   `a` decimal(8,2),
>   `b` int,
>   `c` decimal(5,2))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/user/hive/warehouse/decisample3'
> 3)hive> select * from decisample3;
> OK
> 1.00    2       3.00
> 7.00    8       9.00
> 4)hive> select * from decisample;
> OK
> 4.00    5       6.00
> 5) query:- 
> select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2;
> o/p:-
> OK
> 4.00            NULL
> 1.00    2       3.000000000000000000
> 7.00    8       9.000000000000000000
> Time taken: 87.993 seconds, Fetched: 3 row(s)
> 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1;
> o/p:-
> 4.00
> 1.00    2       3
> 7.00    8       9
> Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing  zero.
> Observation:
> 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required.
> From the explain plan, type casting is not consistent when done 2 different ways:
> a)  select-1  UNION ALL select-2 (Query-5 in above comment)
> vs
> b) select-2 UNION ALL select-2   (Query-6 in above comment)
> Showing only the "expresssions" part of execution plans
> Query-5:
> ========
> ..
> ..
> Map Operator Tree:
>           TableScan
>             alias: a1
>             Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18))
>               outputColumnNames: _col0, _col1, _col2
> ..
> ..
> TableScan
>             alias: a2
>             Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18))
> Query-6:
> ========
> ..
> ..
> Map Operator Tree:
>           TableScan
>             alias: a2
>             Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string)
> ..
> ..
> TableScan
>             alias: a1
>             Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string)
> ..
> ..
> Attaching the execution plans for both queries for reference.
> 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
> And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros.
> 3. This is repeating every time implicit type casting is happening on EMPTY strings.
> If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns.
> For ex:
> Q1:
> select a,b,c from decisample3
> union all
> select a,cast(' ' as int),cast(' ' as decimal) from decisample;
> Q2:
> select a,cast(' ' as int),cast(' ' as decimal) from decisample
> union all
> select a,b,c from decisample3;
> Both the above queries will give consistent result now.
> cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
> cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)