You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/09/16 09:24:00 UTC

[jira] [Created] (HIVE-26540) Iceberg: Select queries after update/delete become expensive in reading contents

Rajesh Balamohan created HIVE-26540:
---------------------------------------

             Summary: Iceberg: Select queries after update/delete become  expensive in reading contents
                 Key: HIVE-26540
                 URL: https://issues.apache.org/jira/browse/HIVE-26540
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


- Create basic date_dim table in tpcds. Store it in iceberg v2 format
- Update few 1000 records couple of times
- Run a simple select query {{select count ( * ) from date_dim_ice where d_qoy = 11 and d_dom=2 and d_fy_week_seq=3;}}

This takes 8-18 seconds where ACID takes 1.5 seconds.

Basic issue is that, it reads files multiple times (i.e both data and delete files).

Lines of interest:

IcebergInputFormat.java

{noformat}
   InternalRecordWrapper wrapper = new InternalRecordWrapper(readSchema.asStruct());
        Evaluator filter = new Evaluator(readSchema.asStruct(), residual, caseSensitive);
        return CloseableIterable.filter(iter, record -> filter.eval(wrapper.wrap((StructLike) record)));
{noformat}



{noformat}
   case GENERIC:
          DeleteFilter deletes = new GenericDeleteFilter(table.io(), currentTask, table.schema(), readSchema);
          Schema requiredSchema = deletes.requiredSchema();
          return deletes.filter(openGeneric(currentTask, requiredSchema));
{noformat}

These get evaluated for each row in the data file, causing delay.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)