You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Suprith (Jira)" <ji...@apache.org> on 2021/03/19 20:56:00 UTC
[jira] [Updated] (HIVE-24915) Distribute by with sort by clause
when used with constant parameter for sort produces wrong result.
[ https://issues.apache.org/jira/browse/HIVE-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suprith updated HIVE-24915:
---------------------------
Description:
Distribute by with sort by clause when used with constant parameter for sort produces wrong result.
Example:
{code:java}
SELECT
t.time,
'a' as const
FROM
(SELECT 1591819264 as time
UNION ALL
SELECT 1591819265 as time) t
DISTRIBUTE by const
sort by const, t.time
{code}
Produces
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
| NULL|{color:#000000}a{color}|
| NULL|{color:#000000}a{color}|
Instead it should produce(Hive 0.13 produces this):
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}|
Incorrect sort columns are used while creating ReduceSink here [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066]
With constant propagation optimizer enabled, due to incorrect constant operator folding, incorrect results will be produced.
More examples for incorrect behavior:
{code:java}
SELECT
t.time,
'a' as const,
t.id
FROM
(SELECT 1591819264 as time, 1 as id
UNION ALL
SELECT 1591819265 as time, 2 as id) t
DISTRIBUTE by t.time
sort by t.time, const, t.id
{code}
produces
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|{color:#000000}*id*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|NULL |
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}| NULL|
was:
Distribute by with sort by clause when used with constant as the first parameter for sort produces wrong result.
Example:
{code:java}
SELECT
t.time,
'a' as const
FROM
(SELECT 1591819264 as time
UNION ALL
SELECT 1591819265 as time) t
DISTRIBUTE by const
sort by const, t.time
{code}
Produces
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
| NULL|{color:#000000}a{color}|
| NULL|{color:#000000}a{color}|
Instead it should produce:
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}|
Incorrect sort columns are used while creating ReduceSink https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066
Summary: Distribute by with sort by clause when used with constant parameter for sort produces wrong result. (was: Distribute by with sort by clause when used with constant as the first parameter for sort produces wrong result.)
> Distribute by with sort by clause when used with constant parameter for sort produces wrong result.
> ---------------------------------------------------------------------------------------------------
>
> Key: HIVE-24915
> URL: https://issues.apache.org/jira/browse/HIVE-24915
> Project: Hive
> Issue Type: Bug
> Affects Versions: 2.3.4
> Reporter: Suprith
> Assignee: Suprith
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Distribute by with sort by clause when used with constant parameter for sort produces wrong result.
> Example:
> {code:java}
> SELECT
> t.time,
> 'a' as const
> FROM
> (SELECT 1591819264 as time
> UNION ALL
> SELECT 1591819265 as time) t
> DISTRIBUTE by const
> sort by const, t.time
> {code}
> Produces
>
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|
> | NULL|{color:#000000}a{color}|
> | NULL|{color:#000000}a{color}|
> Instead it should produce(Hive 0.13 produces this):
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|
> |{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
> |{color:#000000}*1591819265*{color}|{color:#000000}a{color}|
> Incorrect sort columns are used while creating ReduceSink here [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066]
> With constant propagation optimizer enabled, due to incorrect constant operator folding, incorrect results will be produced.
>
> More examples for incorrect behavior:
> {code:java}
> SELECT
> t.time,
> 'a' as const,
> t.id
> FROM
> (SELECT 1591819264 as time, 1 as id
> UNION ALL
> SELECT 1591819265 as time, 2 as id) t
> DISTRIBUTE by t.time
> sort by t.time, const, t.id
> {code}
> produces
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|{color:#000000}*id*{color}|
> |{color:#000000}*1591819264*{color}|{color:#000000}a{color}|NULL |
> |{color:#000000}*1591819265*{color}|{color:#000000}a{color}| NULL|
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)