You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/09/16 09:24:00 UTC
[jira] [Created] (HIVE-26540) Iceberg: Select queries after update/delete become expensive in reading contents
Rajesh Balamohan created HIVE-26540:
---------------------------------------
Summary: Iceberg: Select queries after update/delete become expensive in reading contents
Key: HIVE-26540
URL: https://issues.apache.org/jira/browse/HIVE-26540
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
- Create basic date_dim table in tpcds. Store it in iceberg v2 format
- Update few 1000 records couple of times
- Run a simple select query {{select count ( * ) from date_dim_ice where d_qoy = 11 and d_dom=2 and d_fy_week_seq=3;}}
This takes 8-18 seconds where ACID takes 1.5 seconds.
Basic issue is that, it reads files multiple times (i.e both data and delete files).
Lines of interest:
IcebergInputFormat.java
{noformat}
InternalRecordWrapper wrapper = new InternalRecordWrapper(readSchema.asStruct());
Evaluator filter = new Evaluator(readSchema.asStruct(), residual, caseSensitive);
return CloseableIterable.filter(iter, record -> filter.eval(wrapper.wrap((StructLike) record)));
{noformat}
{noformat}
case GENERIC:
DeleteFilter deletes = new GenericDeleteFilter(table.io(), currentTask, table.schema(), readSchema);
Schema requiredSchema = deletes.requiredSchema();
return deletes.filter(openGeneric(currentTask, requiredSchema));
{noformat}
These get evaluated for each row in the data file, causing delay.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)