You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Suprith (Jira)" <ji...@apache.org> on 2021/03/19 20:56:00 UTC

[jira] [Updated] (HIVE-24915) Distribute by with sort by clause when used with constant parameter for sort produces wrong result.

     [ https://issues.apache.org/jira/browse/HIVE-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suprith updated HIVE-24915:
---------------------------
    Description: 
Distribute by with sort by clause when used with constant parameter for sort produces wrong result.

Example: 
{code:java}
 SELECT 
    t.time,
    'a' as const
  FROM
    (SELECT 1591819264 as time
    UNION ALL
    SELECT 1591819265 as time) t
  DISTRIBUTE by const
  sort by const, t.time
{code}
Produces
  
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
| NULL|{color:#000000}a{color}|
| NULL|{color:#000000}a{color}|

Instead it should produce(Hive 0.13 produces this):
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}|

Incorrect sort columns are used while creating ReduceSink here [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066]

With constant propagation optimizer enabled, due to incorrect constant operator folding, incorrect results will be produced.

 

More examples for incorrect behavior:


{code:java}
  SELECT 
    t.time,
    'a' as const,
    t.id
  FROM
    (SELECT 1591819264 as time, 1 as id
    UNION ALL
    SELECT 1591819265 as time, 2 as id) t
  DISTRIBUTE by t.time
  sort by t.time, const, t.id
{code}
produces


|{color:#000000}*time*{color}|{color:#000000}*const*{color}|{color:#000000}*id*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|NULL |
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}| NULL|

 

  was:
Distribute by with sort by clause when used with constant as the first parameter for sort produces wrong result.

Example: 


{code:java}
 SELECT 
    t.time,
    'a' as const
  FROM
    (SELECT 1591819264 as time
    UNION ALL
    SELECT 1591819265 as time) t
  DISTRIBUTE by const
  sort by const, t.time
{code}

Produces
 
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
| NULL|{color:#000000}a{color}|
| NULL|{color:#000000}a{color}|


Instead it should produce:
|{color:#000000}*time*{color}|{color:#000000}*const*{color}|
|{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
|{color:#000000}*1591819265*{color}|{color:#000000}a{color}|





Incorrect sort columns are used while creating ReduceSink https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066




 

 

        Summary: Distribute by with sort by clause when used with constant parameter for sort produces wrong result.  (was: Distribute by with sort by clause when used with constant as the first parameter for sort produces wrong result.)

> Distribute by with sort by clause when used with constant parameter for sort produces wrong result.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24915
>                 URL: https://issues.apache.org/jira/browse/HIVE-24915
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.4
>            Reporter: Suprith
>            Assignee: Suprith
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Distribute by with sort by clause when used with constant parameter for sort produces wrong result.
> Example: 
> {code:java}
>  SELECT 
>     t.time,
>     'a' as const
>   FROM
>     (SELECT 1591819264 as time
>     UNION ALL
>     SELECT 1591819265 as time) t
>   DISTRIBUTE by const
>   sort by const, t.time
> {code}
> Produces
>   
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|
> | NULL|{color:#000000}a{color}|
> | NULL|{color:#000000}a{color}|
> Instead it should produce(Hive 0.13 produces this):
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|
> |{color:#000000}*1591819264*{color}|{color:#000000}a{color}|
> |{color:#000000}*1591819265*{color}|{color:#000000}a{color}|
> Incorrect sort columns are used while creating ReduceSink here [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9066]
> With constant propagation optimizer enabled, due to incorrect constant operator folding, incorrect results will be produced.
>  
> More examples for incorrect behavior:
> {code:java}
>   SELECT 
>     t.time,
>     'a' as const,
>     t.id
>   FROM
>     (SELECT 1591819264 as time, 1 as id
>     UNION ALL
>     SELECT 1591819265 as time, 2 as id) t
>   DISTRIBUTE by t.time
>   sort by t.time, const, t.id
> {code}
> produces
> |{color:#000000}*time*{color}|{color:#000000}*const*{color}|{color:#000000}*id*{color}|
> |{color:#000000}*1591819264*{color}|{color:#000000}a{color}|NULL |
> |{color:#000000}*1591819265*{color}|{color:#000000}a{color}| NULL|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)