You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ramesh Kumar Thangarajan (Jira)" <ji...@apache.org> on 2020/05/28 14:16:00 UTC

[jira] [Assigned] (HIVE-23541) Vectorization: Unbounded following window function start producing results too early

     [ https://issues.apache.org/jira/browse/HIVE-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ramesh Kumar Thangarajan reassigned HIVE-23541:
-----------------------------------------------

    Assignee: Ramesh Kumar Thangarajan

> Vectorization: Unbounded following window function start producing results too early
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-23541
>                 URL: https://issues.apache.org/jira/browse/HIVE-23541
>             Project: Hive
>          Issue Type: Bug
>          Components: PTF-Windowing, Vectorization
>    Affects Versions: 4.0.0, 3.1.2
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Ramesh Kumar Thangarajan
>            Priority: Major
>
> ReduceRecordSource indicates the end of group for a reducer input, whenever the entire key changes.
> ReduceRecordSource::processVectorGroup calls reducer.setNextVectorBatchGroupStatus(/* isLastGroupBatch */ true); when the last group is being processed.
> However for PTF window functions with unbounded following, this is triggered by the key changing and not the partition changing.
> This results in the VectorPTFOperator detect a change in the sort key as a switch of the partition key and start producing results too early.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFOperator.java#L399
> {code}
> create temporary table test2(id STRING,name STRING,event_dt date) stored as orc;
> insert into test2 values ('100','A','2019-08-15'), ('100','A','2019-10-12');
> SELECT name, event_dt, first_value(event_dt) over (PARTITION BY name ORDER BY event_dt desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT_ROW) last_event_dt FROM test2; -- streaming FIRST_VALUE with DESCENDING
> SELECT name, event_dt, last_value(event_dt) over (PARTITION BY name ORDER BY event_dt asc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_event_dt FROM test2; -- non-streaming LAST_VALUE with ASCENDING
> {code}
> These two queries should return identical results, with the streaming version being significantly faster than the non-streaming one, due to the lack of buffered/spilled rows with streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)