You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by OpenInx <op...@gmail.com> on 2021/03/15 03:29:39 UTC

Sync: the progress of row-level delete

Hi iceberg dev:

Currently,   Junjie Chen and I have made some progress about the Rewrite
Action for format v2.  We will have two kinds of Rewrite Action:

1.   The first one is rewriting equality delete rows into position delete
rows.  The PoC PR is here: https://github.com/apache/iceberg/pull/2216
2.  The second one is removing all deletes when rewrite.  The PR is:
https://github.com/apache/iceberg/pull/2303

The motivation that Junjie and I made the priority of RewriteAction a bit
higher is:  we have some Asia companies who are doing the PoC about writing
CDC/Upsert events into iceberg tables and then read it by batch
flink/spark/presto job.   The biggest bottleneck is small delete/data
files, as the streaming job checkpoint periodically,  it will produce so
many small data/equality/pos files in the underlying filesystem,  that will
affect read performance.

About the implementation of RewriteAction,  I think we are confident to
accomplish this.  The key problem is:  How to handle the conflicts between
RewriteFiles txn and  RowDelta txn ?   I filed an issue here:
https://github.com/apache/iceberg/issues/2308

In my opinion,   The RewriteFiles action will never change the data set of
the iceberg table, I mean it will not even add/remove/change a row.  So
from the database developer perspective,  it should not conflict with the
normal rewrite actions because there's no key/row overlap between the two
actions.  But for the iceberg implementation,  we have to handle the
conflicts because both RewriteAction and RowDelta txn are sharing the same
increasing sequence number.

Let's discuss the case from ISSUE#2308:

The original table data set will have data set with seq id1:

Seq1:  (RowDelta 1)
INSERT,  <1, A>
INSERT,  <2, B>
DELETE, <1, A>

If RewriteAction commit before the following RowDelta, then will have the
following operations with the sequence number: ( Finally, it will get the
empty set when reading from the latest snapshot)

Seq2:  (Rewrite)
INSERT, <2, B>

Seq3:  (RowDelta 2)
DELETE, <2, B>

While if RowDelta commit before the RewriteAction, then will have the
following operations with sequence number:   (Finally, it will get the <2,
B> when reading from the latest snapshot )

Seq2: (RowDelta 2)
DELETE, <2, B>

Seq3: (Rewrite)
INSERT, <2,B>


Summary:   As we can see,  different commit orders will produce different
data sets in the iceberg table, that's not the expected semantic from a
user perspective.   So I'm considering the RewriteFilesAction could just
commit the txn without producing a new auto-increasing sequence id (use the
largest sequence number among the existing files for RewriteAction) ,  then
the results will always be consistent without considering the commit
order.    Since this change is touching the iceberg table format/spec,  I'd
like to hear your voice.  What do you think about this thing ?

Thanks.