You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Chen Song <ch...@gmail.com> on 2021/04/09 00:15:05 UTC

question on range overwrite/delete within a partition

Say if I have a table that contains the following data rows.

date, content
20210301, "a1"
20210302, "a2"
20210303, "a3"
...
20210401, "b1"
20210402, "b2"
20210403, "b3"
20210404, "b4"
20210405, "b5"

The table is partitioned by month(date) and data is properly stored in
partitioned data files in sorted order when writing.

If I want to delete a range of data rows by a date range [20210402,
20210404] in partition 202104, as shown below. *Assuming I can only use
Iceberg core API*:

date, content
20210301, "a1"
20210302, "a2"
20210303, "a3"
...
20210401, "b1"
20210402, "b2"
20210403, "b3"
20210404, "b4"
20210405, "b5"

I can think of the following options.

1. I know I can rewrite the entire partition by reading the data and remove
the range of rows. That will create new data files and delete the old data
files.
2. I looked a bit on in position delete files
<https://iceberg.apache.org/spec/#position-delete-files> and equality
delete files <https://iceberg.apache.org/spec/#equality-delete-files> V2 to
see if I can use row level delete files to include the rows to be deleted.
Equality delete won't work here because it needs to match for a range (or
some predicate) but not a single value. Position delete doesn't seem
working too because I would not know beforehand the exact positions of rows
within the data file to be deleted (I only know the key range). I know I
can read the data file and then figure out the positions but that is
effectively the same as re-reading the data.

My question is, when using Iceberg core API, is there a way to compose a
range delete like the above, w/o overwrite the entire partition, or reading
back the data? Any thoughts?

-- 
Chen Song