You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Gopal Vijayaraghavan (Jira)" <ji...@apache.org> on 2020/05/23 06:42:00 UTC

[jira] [Created] (HIVE-23541) Vectorization: Unbounded following window function start producing results too early

Gopal Vijayaraghavan created HIVE-23541:
-------------------------------------------

             Summary: Vectorization: Unbounded following window function start producing results too early
                 Key: HIVE-23541
                 URL: https://issues.apache.org/jira/browse/HIVE-23541
             Project: Hive
          Issue Type: Bug
            Reporter: Gopal Vijayaraghavan


ReduceRecordSource indicates the end of group for a reducer input, whenever the entire key changes.

ReduceRecordSource::processVectorGroup calls reducer.setNextVectorBatchGroupStatus(/* isLastGroupBatch */ true); when the last group is being processed.

However for PTF window functions with unbounded following, this is triggered by the key changing and not the partition changing.

This results in the VectorPTFOperator detect a change in the sort key as a switch of the partition key and start producing results too early.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFOperator.java#L399

{code}
create temporary table test2(id STRING,name STRING,event_dt date) stored as orc;

insert into test2 values ('100','A','2019-08-15'), ('100','A','2019-10-12');


SELECT name, event_dt, first_value(event_dt) over (PARTITION BY name ORDER BY event_dt desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT_ROW) last_event_dt FROM test2; -- streaming FIRST_VALUE with DESCENDING

SELECT name, event_dt, last_value(event_dt) over (PARTITION BY name ORDER BY event_dt asc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_event_dt FROM test2; -- non-streaming LAST_VALUE with ASCENDING
{code}

These two queries should return identical results, with the streaming version being significantly faster than the non-streaming one, due to the lack of buffered/spilled rows with streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)