You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Vivekanand Vellanki <vi...@dremio.com> on 2021/01/19 15:47:36 UTC

Modifying a Parquet file in an Iceberg table

Hi,

I understand that files in Iceberg tables are immutable. However, one of
our use-cases requires modifying a Parquet file belonging to an Iceberg
table, and I am trying to figure out how to support this.

Will performing an Iceberg transaction that first deletes the file and adds
it back work?

The spec contains the following:

   1. Technically, data files can be deleted when the last snapshot that
   contains the file as “live” data is garbage collected. But this is harder
   to detect and requires finding the diff of multiple snapshots. It is easier
   to track what files are deleted in a snapshot and delete them when that
   snapshot expires.

From the above, it looks like deleting a file and adding it back as 2
separate transactions will not work. The file can be garbage collected when
the transaction that did the delete expires.

Is there a way to delete a file and add it back in the same transaction?

Thanks
Vivek

Re: Modifying a Parquet file in an Iceberg table

Posted by Russell Spitzer <ru...@gmail.com>.
Modifications are usually done (see the overwrite code) using a single transaction which deletes the old file and creates the new one. 

https://github.com/apache/iceberg/blob/425b10cea34eef11cca9cf0d237e02274f6dc958/core/src/test/java/org/apache/iceberg/TestTransaction.java#L95-L122 <https://github.com/apache/iceberg/blob/425b10cea34eef11cca9cf0d237e02274f6dc958/core/src/test/java/org/apache/iceberg/TestTransaction.java#L95-L122>

But I’m not sure this would help you since Iceberg Deletes are not physical deletes, the file would not be removed physical from the machine so I’m not sure when you would actually swap the underlying data file. 

Although I would be very interested to know why it has to be the same file name? It may help to give a broader explanation of your use case because my gut says this is probably not something you really want to be doing.

> On Jan 19, 2021, at 9:47 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> Hi,
> 
> I understand that files in Iceberg tables are immutable. However, one of our use-cases requires modifying a Parquet file belonging to an Iceberg table, and I am trying to figure out how to support this.
> 
> Will performing an Iceberg transaction that first deletes the file and adds it back work?
> 
> The spec contains the following:
> Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
> From the above, it looks like deleting a file and adding it back as 2 separate transactions will not work. The file can be garbage collected when the transaction that did the delete expires.
> 
> Is there a way to delete a file and add it back in the same transaction?
> 
> Thanks
> Vivek