You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Zoltán Borók-Nagy <bo...@cloudera.com> on 2022/07/08 15:34:39 UTC

Impala reading V2 tables design doc

Hi Iceberg/Impala Team,

We've been working on adding read support for Iceberg V2 tables in Impala.
In the first round we're focusing on position deletes.

We are thinking about different approaches so I've written a design doc
about it:
https://docs.google.com/document/d/1WF_UOanQ61RUuQlM4LaiRWI0YXpPKZ2VEJ8gyJdDyoY/

TL;DR:
The Scan Planning <https://iceberg.apache.org/spec/#scan-planning> of the
Iceberg spec says:
A position delete file must be applied to a data file when all of the
following are true:

   - The data file’s sequence number is less than or equal to the delete
   file’s sequence number
   - ...

Basically we would like to do an ANTI JOIN between data files and delete
files. We have some troubles with sequence numbers though, as these are not
exposed by the Iceberg API.
Does Iceberg allow deleting a data file, then adding a new one with the
same name? Probably no, as it would cause all kinds of problems, e.g. time
travel issues, and I can see that Iceberg generates unique file names. So
if the answer is no, then we probably don't even need the sequence number
during query execution.

This and other interesting challenges/questions are in the doc, hope you
guys enjoy reading it!

Cheers,
     Zoltan Borok-Nagy