You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2023/03/03 02:18:00 UTC
[jira] [Created] (HIVE-27119) Iceberg: Delete from table generates lot of files
Rajesh Balamohan created HIVE-27119:
---------------------------------------
Summary: Iceberg: Delete from table generates lot of files
Key: HIVE-27119
URL: https://issues.apache.org/jira/browse/HIVE-27119
Project: Hive
Issue Type: Improvement
Components: Iceberg integration
Reporter: Rajesh Balamohan
With "delete" it generates lot of files due to the way data is sent to the reducers. Files per partition is impacted by the number of reduce tasks.
One way could be to explicitly control the number of reducers; Creating this ticket to have a long term fix.
{noformat}
explain delete from store_Sales where ss_customer_sk % 10 = 0;
INFO : Compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, ss_net_paid_inc_tax
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.704 seconds
INFO : Executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO : Starting task [Stage-4:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.005 seconds
INFO : OK
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
Stage-3 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Tez
DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
Statistics: Num rows: 2755519629 Data size: 3643899155232 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: PARTITION__SPEC__ID (type: int), PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: decimal(7,2)), ss_sold_date_sk (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26
Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: string), _col3 (type: bigint)
null sort order: zzzz
sort order: ++++
Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
value expressions: _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 (type: decimal(7,2)), _col17 (type: decimal(7,2)), _col18 (type: decimal(7,2)), _col19 (type: decimal(7,2)), _col20 (type: decimal(7,2)), _col21 (type: decimal(7,2)), _col22 (type: decimal(7,2)), _col23 (type: decimal(7,2)), _col24 (type: decimal(7,2)), _col25 (type: decimal(7,2)), _col26 (type: int)
Execution mode: llap
LLAP IO: no inputs
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: bigint), KEY.reducesinkkey2 (type: string), KEY.reducesinkkey3 (type: bigint), VALUE._col0 (type: int), VALUE._col1 (type: int), VALUE._col2 (type: int), VALUE._col3 (type: int), VALUE._col4 (type: int), VALUE._col5 (type: int), VALUE._col6 (type: int), VALUE._col7 (type: int), VALUE._col8 (type: bigint), VALUE._col9 (type: int), VALUE._col10 (type: decimal(7,2)), VALUE._col11 (type: decimal(7,2)), VALUE._col12 (type: decimal(7,2)), VALUE._col13 (type: decimal(7,2)), VALUE._col14 (type: decimal(7,2)), VALUE._col15 (type: decimal(7,2)), VALUE._col16 (type: decimal(7,2)), VALUE._col17 (type: decimal(7,2)), VALUE._col18 (type: decimal(7,2)), VALUE._col19 (type: decimal(7,2)), VALUE._col20 (type: decimal(7,2)), VALUE._col21 (type: decimal(7,2)), VALUE._col22 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26
Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
name: tpcds_1000_iceberg_mor_v4.store_sales
Stage: Stage-2
Dependency Collection
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
name: tpcds_1000_iceberg_mor_v4.store_sales
Stage: Stage-3
Stats Work
Basic Stats Work:
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)