You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2023/03/03 02:18:00 UTC

[jira] [Created] (HIVE-27119) Iceberg: Delete from table generates lot of files

Rajesh Balamohan created HIVE-27119:
---------------------------------------

             Summary: Iceberg: Delete from table generates lot of files
                 Key: HIVE-27119
                 URL: https://issues.apache.org/jira/browse/HIVE-27119
             Project: Hive
          Issue Type: Improvement
          Components: Iceberg integration
            Reporter: Rajesh Balamohan


With "delete" it generates lot of files due to the way data is sent to the reducers. Files per partition is impacted by the number of reduce tasks.

One way could be to explicitly control the number of reducers; Creating this ticket to have a long term fix.
 
{noformat}
 explain delete from store_Sales where ss_customer_sk % 10 = 0;
INFO  : Compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO  : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, ss_net_paid_inc_tax
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.704 seconds
INFO  : Executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.005 seconds
INFO  : OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
                  Statistics: Num rows: 2755519629 Data size: 3643899155232 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
                    Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: PARTITION__SPEC__ID (type: int), PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: decimal(7,2)), ss_sold_date_sk (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26
                      Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: string), _col3 (type: bigint)
                        null sort order: zzzz
                        sort order: ++++
                        Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 (type: decimal(7,2)), _col17 (type: decimal(7,2)), _col18 (type: decimal(7,2)), _col19 (type: decimal(7,2)), _col20 (type: decimal(7,2)), _col21 (type: decimal(7,2)), _col22 (type: decimal(7,2)), _col23 (type: decimal(7,2)), _col24 (type: decimal(7,2)), _col25 (type: decimal(7,2)), _col26 (type: int)
            Execution mode: llap
            LLAP IO: no inputs
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: bigint), KEY.reducesinkkey2 (type: string), KEY.reducesinkkey3 (type: bigint), VALUE._col0 (type: int), VALUE._col1 (type: int), VALUE._col2 (type: int), VALUE._col3 (type: int), VALUE._col4 (type: int), VALUE._col5 (type: int), VALUE._col6 (type: int), VALUE._col7 (type: int), VALUE._col8 (type: bigint), VALUE._col9 (type: int), VALUE._col10 (type: decimal(7,2)), VALUE._col11 (type: decimal(7,2)), VALUE._col12 (type: decimal(7,2)), VALUE._col13 (type: decimal(7,2)), VALUE._col14 (type: decimal(7,2)), VALUE._col15 (type: decimal(7,2)), VALUE._col16 (type: decimal(7,2)), VALUE._col17 (type: decimal(7,2)), VALUE._col18 (type: decimal(7,2)), VALUE._col19 (type: decimal(7,2)), VALUE._col20 (type: decimal(7,2)), VALUE._col21 (type: decimal(7,2)), VALUE._col22 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26
                Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
                      output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
                      serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
                      name: tpcds_1000_iceberg_mor_v4.store_sales

  Stage: Stage-2
    Dependency Collection

  Stage: Stage-0
    Move Operator
      tables:
          replace: false
          table:
              input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
              output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
              serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
              name: tpcds_1000_iceberg_mor_v4.store_sales

  Stage: Stage-3
    Stats Work
      Basic Stats Work:
 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)