You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "gaozhan ding (Jira)" <ji...@apache.org> on 2021/05/21 10:30:00 UTC
[jira] [Created] (HIVE-25147) Limit offset query in CTAS will cause
data loss
gaozhan ding created HIVE-25147:
-----------------------------------
Summary: Limit offset query in CTAS will cause data loss
Key: HIVE-25147
URL: https://issues.apache.org/jira/browse/HIVE-25147
Project: Hive
Issue Type: Bug
Components: Parser
Affects Versions: 3.1.0
Reporter: gaozhan ding
query like:
{code:java}
create table ... as select ... from ... limit offset
or:
insert overwrite table ... select ... from ... limit offset
{code}
will cause data loss.
reproduce step:
{code:java}
create table test_limit_offset (id int);
insert into test_limit_offset values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16);
drop table if exists test_limit_offset2;
create table test_limit_offset2 as select * from test_limit_offset limit 5 offset 2;
{code}
query test_limit_offset2
{code:java}
+------------------------+
| test_limit_offset2.id |
+------------------------+
| 5 |
| 6 |
| 7 |
+------------------------+
{code}
expected 5 numbers while got 3;
We can see the problem from the execution plan
{code:java}
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-3 |
| Stats Work{} |
| Stage-9 |
| Create Table Operator: |
| name:dgz.test_limit_offset2 |
| Stage-2 |
| Dependency Collection{} |
| Stage-5(CONDITIONAL) |
| Move Operator |
| Stage-8(CONDITIONAL CHILD TASKS: Stage-5, Stage-4, Stage-6) |
| Conditional Operator |
| Stage-1 |
| Reducer 2 |
| File Output Operator [FS_6] |
| table:{"name:":"dgz.test_limit_offset2"} |
| Limit [LIM_5] (rows=5 width=1) | //reduce side full limit offset
| Number of rows:5,Offset of rows:2 |
| Select Operator [SEL_4] (rows=5 width=1) |
| Output:["_col0"] |
| <-Map 1 [CUSTOM_SIMPLE_EDGE] |
| PARTITION_ONLY_SHUFFLE [RS_3] |
| Limit [LIM_2] (rows=5 width=1) | //map side full limit offset
| Number of rows:5,Offset of rows:2 |
| Select Operator [SEL_1] (rows=13 width=1) |
| Output:["_col0"] |
| TableScan [TS_0] (rows=13 width=1) |
| dgz@test_limit_offset,test_limit_offset,Tbl:COMPLETE,Col:NONE,Output:["id"] |
| Stage-4(CONDITIONAL) |
| File Merge |
| Please refer to the previous Stage-8(CONDITIONAL CHILD TASKS: Stage-5, Stage-4, Stage-6) |
| Stage-7 |
| Move Operator |
| Stage-6(CONDITIONAL) |
| File Merge |
| Please refer to the previous Stage-8(CONDITIONAL CHILD TASKS: Stage-5, Stage-4, Stage-6) |
| Stage-0 |
| Move Operator |
| Please refer to the previous Stage-5(CONDITIONAL) |
| Please refer to the previous Stage-4(CONDITIONAL) |
| Please refer to the previous Stage-7 |
| |
+----------------------------------------------------+
{code}
It generate limit operator on both map and reduce side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)