You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ramesh Kumar Thangarajan (Jira)" <ji...@apache.org> on 2020/05/28 14:16:00 UTC
[jira] [Assigned] (HIVE-23541) Vectorization: Unbounded following
window function start producing results too early
[ https://issues.apache.org/jira/browse/HIVE-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ramesh Kumar Thangarajan reassigned HIVE-23541:
-----------------------------------------------
Assignee: Ramesh Kumar Thangarajan
> Vectorization: Unbounded following window function start producing results too early
> ------------------------------------------------------------------------------------
>
> Key: HIVE-23541
> URL: https://issues.apache.org/jira/browse/HIVE-23541
> Project: Hive
> Issue Type: Bug
> Components: PTF-Windowing, Vectorization
> Affects Versions: 4.0.0, 3.1.2
> Reporter: Gopal Vijayaraghavan
> Assignee: Ramesh Kumar Thangarajan
> Priority: Major
>
> ReduceRecordSource indicates the end of group for a reducer input, whenever the entire key changes.
> ReduceRecordSource::processVectorGroup calls reducer.setNextVectorBatchGroupStatus(/* isLastGroupBatch */ true); when the last group is being processed.
> However for PTF window functions with unbounded following, this is triggered by the key changing and not the partition changing.
> This results in the VectorPTFOperator detect a change in the sort key as a switch of the partition key and start producing results too early.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFOperator.java#L399
> {code}
> create temporary table test2(id STRING,name STRING,event_dt date) stored as orc;
> insert into test2 values ('100','A','2019-08-15'), ('100','A','2019-10-12');
> SELECT name, event_dt, first_value(event_dt) over (PARTITION BY name ORDER BY event_dt desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT_ROW) last_event_dt FROM test2; -- streaming FIRST_VALUE with DESCENDING
> SELECT name, event_dt, last_value(event_dt) over (PARTITION BY name ORDER BY event_dt asc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_event_dt FROM test2; -- non-streaming LAST_VALUE with ASCENDING
> {code}
> These two queries should return identical results, with the streaming version being significantly faster than the non-streaming one, due to the lack of buffered/spilled rows with streaming.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)