You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Vivekanand Vellanki <vi...@dremio.com> on 2021/05/16 09:52:57 UTC

Querying older versions of an Iceberg table

Hi,

I would like to understand if Iceberg supports the following scenario:

   - At time t1, there's a table with a file f1.parquet
   - At time t2, f1.parquet is removed from the table. f1.parquet is also
   deleted from the filesystem
   - Querying table@t1 results in errors since f1.parquet is no longer
   available in the filesystem
   - At time t3, f1.parquet is recreated and added back to the table
   - Querying table@t1 now results in potentially incorrect results since
   f1.parquet is now present in the filesystem

Should there be a version identifier for each data-file in the manifest
file to handle such scenarios?

Thanks
Vivek

Re: Querying older versions of an Iceberg table

Posted by Vivekanand Vellanki <vi...@dremio.com>.
>
> Maybe the user does want to change a row in a file directly and replace
> the file to get an updated result quickly bypassing the Iceberg API
>

Actually, the above is one of the reasons our customers overwrite Parquet
files. They discover that the Parquet file contains incorrect data - they
fix it by recreating a new Parquet file with the corrected data; and
replace the old version of the file with the new version of the file
bypassing Iceberg completely.

On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:

> I actually think there is an argument against that use case of returning
> an error after time t3. Maybe the user does want to change a row in a file
> directly and replace the file to get an updated result quickly bypassing
> the Iceberg API. In that case failing that query after t3 would block that
> use case. The statistics in manifest might be wrong, but we can further
> argue that the user can directly modify statistics and replace files all
> the way up to the snapshot to make sure everything continues to work.
>
> In general, if a user decides to bypass the contract set by Iceberg, I
> believe that we should not predict the behavior and compensate the system
> for that behavior, because users can bypass the contract in all different
> ways and it will open the door to satisfy many awkward use cases and in the
> end break assumptions to the fundamentals.
>
> In this case you described, I think the existing Iceberg behavior makes
> total sense. If you would like to achieve what you described later, you can
> potentially update your FileIO and leverage the versioning feature of the
> underlying storage to make sure that the file uploaded never has the same
> identifier, so that users cannot replace a file at t3. For example, if you
> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
> that each file path is not just the s3 path, but the s3 path + version.
>
> But this is just what I think, let's see how others reply.
>
> -Jack
>
> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> From an Iceberg perspective, I understand what you are saying.
>>
>> A lot of our customers add/remove files to the table using scripts. The
>> typical workflow would be:
>> - Create Parquet files using other tools
>> - Add these files to the Iceberg table
>>
>> Similarly, for removing Parquet files from the table. I understand that
>> Iceberg doesn't delete the data file until all snapshots that refer to that
>> data file expire. However, the customer can delete the file directly - they
>> might understand that a query on a snapshot will fail.
>>
>> I am concerned that an unintentional mistake in updating the Iceberg
>> table results in incorrect results while querying an Iceberg snapshot. It
>> is ok to return an error when a file referred to by a snapshot does not
>> exist.
>>
>> This issue can be addressed by adding a version identifier (e.g. mtime)
>> in the DataFile object and including this information in the manifest file.
>> This ensures that snapshot reads are correct even when users make mistakes
>> while adding/removing files to the table.
>>
>> We can work on this, if there is sufficient interest.
>>
>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>
>>> In the real system each file would have a unique universal identifier.
>>> When iceberg does a delete it doesn’t actually remove the file it creates a
>>> new meta-data file which no longer includes that file. When you attempt to
>>> access the table of time one you were actually just reading the first
>>> meta-data file enough the new meta-data file which is missing the entry for
>>> the deleted file.
>>>
>>> The only way to end up in the scenario you describe is if you were
>>> manually deleting files and adding files using the iceberg internal API and
>>> not some thing like spark or flink.
>>>
>>> What actually happens is some thing like at
>>> T1 metadata says f1-uuid exists
>>>
>>> The data is deleted
>>> T2 metadata no longer list f1
>>>
>>> New data is written
>>> T3 metadata says f3_uuid now exists
>>>
>>> Data files are only physically deleted by iceberg through the expire
>>> snapshots command. This removes the snapshot meta-data as well as any data
>>> files which are only referred to by those snap shots that are expired.
>>>
>>> If you are using the internal api (org.apache.iceberg.Table) then it is
>>> your responsibility to not perform operations or delete files that would
>>> violate the uniqueness of each snapshot. In this case you would similarly
>>> solve the problem by just not physically deleting the file when you remove
>>> it. Although usually having unique names every time you add data is a good
>>> safety measure.
>>>
>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>> 
>>> Hi,
>>>
>>> I would like to understand if Iceberg supports the following scenario:
>>>
>>>    - At time t1, there's a table with a file f1.parquet
>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>    also deleted from the filesystem
>>>    - Querying table@t1 results in errors since f1.parquet is no longer
>>>    available in the filesystem
>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>    - Querying table@t1 now results in potentially incorrect results
>>>    since f1.parquet is now present in the filesystem
>>>
>>> Should there be a version identifier for each data-file in the manifest
>>> file to handle such scenarios?
>>>
>>> Thanks
>>> Vivek
>>>
>>>

Re: Querying older versions of an Iceberg table

Posted by ru...@gmail.com.
Part of the encryption proposal does include adding a data validation checksum or alike. That could potentially work but would still end up with lots of weird errors if users are replacing files and deleting things outside of the iceberg api.

Sent from my iPhone

> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> 
> The manifest files do not have a provision for specifying the version of a data file. So, it is not really possible to read "s3 path + version" because the manifest files do not have the version information for the data file.
> 
> https://iceberg.apache.org/spec/#manifests
> 
>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>> I actually think there is an argument against that use case of returning an error after time t3. Maybe the user does want to change a row in a file directly and replace the file to get an updated result quickly bypassing the Iceberg API. In that case failing that query after t3 would block that use case. The statistics in manifest might be wrong, but we can further argue that the user can directly modify statistics and replace files all the way up to the snapshot to make sure everything continues to work.
>> 
>> In general, if a user decides to bypass the contract set by Iceberg, I believe that we should not predict the behavior and compensate the system for that behavior, because users can bypass the contract in all different ways and it will open the door to satisfy many awkward use cases and in the end break assumptions to the fundamentals. 
>> 
>> In this case you described, I think the existing Iceberg behavior makes total sense. If you would like to achieve what you described later, you can potentially update your FileIO and leverage the versioning feature of the underlying storage to make sure that the file uploaded never has the same identifier, so that users cannot replace a file at t3. For example, if you are running on S3, you can enable S3 versioning, and extend the S3FIleIO so that each file path is not just the s3 path, but the s3 path + version.
>> 
>> But this is just what I think, let's see how others reply.
>> 
>> -Jack 
>> 
>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com> wrote:
>>> From an Iceberg perspective, I understand what you are saying.
>>> 
>>> A lot of our customers add/remove files to the table using scripts. The typical workflow would be:
>>> - Create Parquet files using other tools
>>> - Add these files to the Iceberg table
>>> 
>>> Similarly, for removing Parquet files from the table. I understand that Iceberg doesn't delete the data file until all snapshots that refer to that data file expire. However, the customer can delete the file directly - they might understand that a query on a snapshot will fail.
>>> 
>>> I am concerned that an unintentional mistake in updating the Iceberg table results in incorrect results while querying an Iceberg snapshot. It is ok to return an error when a file referred to by a snapshot does not exist.
>>> 
>>> This issue can be addressed by adding a version identifier (e.g. mtime) in the DataFile object and including this information in the manifest file. This ensures that snapshot reads are correct even when users make mistakes while adding/removing files to the table.
>>> 
>>> We can work on this, if there is sufficient interest.
>>> 
>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>> In the real system each file would have a unique universal identifier. When iceberg does a delete it doesn’t actually remove the file it creates a new meta-data file which no longer includes that file. When you attempt to access the table of time one you were actually just reading the first meta-data file enough the new meta-data file which is missing the entry for the deleted file.
>>>> 
>>>> The only way to end up in the scenario you describe is if you were manually deleting files and adding files using the iceberg internal API and not some thing like spark or flink.
>>>> 
>>>> What actually happens is some thing like at 
>>>> T1 metadata says f1-uuid exists
>>>> 
>>>> The data is deleted
>>>> T2 metadata no longer list f1
>>>> 
>>>> New data is written
>>>> T3 metadata says f3_uuid now exists
>>>> 
>>>> Data files are only physically deleted by iceberg through the expire snapshots command. This removes the snapshot meta-data as well as any data files which are only referred to by those snap shots that are expired.
>>>> 
>>>> If you are using the internal api (org.apache.iceberg.Table) then it is your responsibility to not perform operations or delete files that would violate the uniqueness of each snapshot. In this case you would similarly solve the problem by just not physically deleting the file when you remove it. Although usually having unique names every time you add data is a good safety measure.
>>>> 
>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
>>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I would like to understand if Iceberg supports the following scenario:
>>>>> At time t1, there's a table with a file f1.parquet
>>>>> At time t2, f1.parquet is removed from the table. f1.parquet is also deleted from the filesystem
>>>>> Querying table@t1 results in errors since f1.parquet is no longer available in the filesystem
>>>>> At time t3, f1.parquet is recreated and added back to the table
>>>>> Querying table@t1 now results in potentially incorrect results since f1.parquet is now present in the filesystem
>>>>> Should there be a version identifier for each data-file in the manifest file to handle such scenarios?
>>>>> 
>>>>> Thanks
>>>>> Vivek
>>>>> 

Re: Querying older versions of an Iceberg table

Posted by Ryan Blue <bl...@apache.org>.
It sounds like the fundamental problem here is that users expect to be able
to alter files underneath the table directly. That's not a good idea
because of these issues. Iceberg is designed to make it so that users don't
need to manage the files underneath tables directly and I think we should
operate under the assumption that they don't make direct changes in the
file system -- you wouldn't expect any other databases to work if you
swapped data files underneath them, either.

To me, this sounds like a problem of user assumptions (that they can change
files directly) and tooling. I'd much rather fix the tooling to write
directly to the Iceberg table or build better patterns for importing data
files. For example, we could build a service that watches a directory and
runs the equivalent of LOAD DATA INFILE for anything dropped there. That
gives similar convenience, but avoids the issue of people expecting to be
able to change their table's files.

On Tue, May 18, 2021 at 8:57 AM Daniel Weeks <dw...@apache.org> wrote:

> Hey Vivek,
>
> I think as you can see throughout this discussion there are a number of
> issues with modifying the data files outside of Iceberg APIs.  To maintain
> data integrity, it's advised to only operate on the data through Iceberg.
>
> In many ways this is similar to trying to change the history of a single
> commit within git without changing anything before or after the commit,
> which isn't a good idea.
>
> Beyond the issue of file size (which is used by some engines for spit
> calculation and may result in failures).  Iceberg also tracks stats at the
> file level which is also used for filtering and metadata operations.  I'd
> be more concerned that if the stats don't align with the new data file, you
> would get incorrect results (in addition to Russel's comments).
>
> I feel like the correct approach for this would be to use the APIs to
> perform a replace operation of the file(s) in question.  This doesn't fix
> your original issue of time travel, but the only alternative for that would
> be to reset back to the original snapshot and then replay all of the
> commits after.  That might be a better approach if you want to rewrite
> history, but I don't think there's an atomic way to do that at the moment.
>
> Just a few things to consider,
> -Dan
>
>
>
>
>
>
> On Sun, May 16, 2021 at 10:32 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Thinking about this more, if a Parquet file is deleted from the table and
>> added back (both transactions using Iceberg APIs), this will lead to issues
>> with snapshot reads on the table.
>>
>> If the fileSize changes, I believe the snapshot reads will fail (depends
>> on how the reader handles these failures).
>>
>> If the fileSize remains unchanged but the data changed, the snapshot
>> query now returns different data.
>>
>> On Mon, May 17, 2021 at 10:41 AM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>>
>>>
>>> On Mon, May 17, 2021 at 10:20 AM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> > Actually, the above is one of the reasons our customers overwrite
>>>> Parquet files. They discover that the Parquet file contains incorrect data
>>>> - they fix it by recreating a new Parquet file with the corrected data; and
>>>> replace the old version of the file with the new version of the file
>>>> bypassing Iceberg completely.
>>>>
>>>> If that is the case, I am a bit confused, isn't the current Iceberg
>>>> behavior what they need? Do you want to prevent this from happening?
>>>>
>>>
>>> Changing files without informing Iceberg will cause issues. I am
>>> assuming that the Parquet/ORC readers will assume that the fileSize is
>>> accurate and read the footer/tail assuming that to be true.
>>>
>>> If files are modified, Iceberg should be informed. My concern is that
>>> even when Iceberg is informed, snapshot reads will return incorrect
>>> results. I believe this should result in an error since snapshots should be
>>> immutable and queries on snapshots should return the same results.
>>>
>>> > The manifest files do not have a provision for specifying the version
>>>> of a data file. So, it is not really possible to read "s3 path + version"
>>>> because the manifest files do not have the version information for the data
>>>> file.
>>>>
>>>> What I mean is that you can define your file location as s3 path +
>>>> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
>>>> this value is recorded as the location in DataFile and your FileIO can
>>>> parse this location, then there is no way to replace the file in that given
>>>> location.
>>>>
>>>> > Part of the encryption proposal does include adding a data validation
>>>> checksum or alike.
>>>>
>>>> Yes Russell made a good point, the proposed encryption support would
>>>> make the process of replacing a file very hard if authenticated encryption
>>>> is enabled (although still possible to decrypt and reuse the same AAD if
>>>> the user is a super user). You can read more about it here
>>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>>>>
>>>
>>> Thanks, I will take a look at the proposal.
>>>
>>> I am looking for something that is supported by Iceberg natively.
>>>
>>> -Jack
>>>>
>>>>
>>>> On Sun, May 16, 2021 at 9:38 PM <ru...@gmail.com> wrote:
>>>>
>>>>> Part of the encryption proposal does include adding a data validation
>>>>> checksum or alike. That could potentially work to prevent corrupted
>>>>> requests since you would get an error that the file does not match the
>>>>> checksum but would still end up with lots of weird errors if users are
>>>>> replacing files and deleting things outside of the iceberg api.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> The manifest files do not have a provision for specifying the version
>>>>> of a data file. So, it is not really possible to read "s3 path + version"
>>>>> because the manifest files do not have the version information for the data
>>>>> file.
>>>>>
>>>>> https://iceberg.apache.org/spec/#manifests
>>>>>
>>>>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> I actually think there is an argument against that use case of
>>>>>> returning an error after time t3. Maybe the user does want to change a row
>>>>>> in a file directly and replace the file to get an updated result quickly
>>>>>> bypassing the Iceberg API. In that case failing that query after t3 would
>>>>>> block that use case. The statistics in manifest might be wrong, but we can
>>>>>> further argue that the user can directly modify statistics and replace
>>>>>> files all the way up to the snapshot to make sure everything continues to
>>>>>> work.
>>>>>>
>>>>>> In general, if a user decides to bypass the contract set by Iceberg,
>>>>>> I believe that we should not predict the behavior and compensate the system
>>>>>> for that behavior, because users can bypass the contract in all different
>>>>>> ways and it will open the door to satisfy many awkward use cases and in the
>>>>>> end break assumptions to the fundamentals.
>>>>>>
>>>>>> In this case you described, I think the existing Iceberg behavior
>>>>>> makes total sense. If you would like to achieve what you described later,
>>>>>> you can potentially update your FileIO and leverage the versioning feature
>>>>>> of the underlying storage to make sure that the file uploaded never has the
>>>>>> same identifier, so that users cannot replace a file at t3. For example, if
>>>>>> you are running on S3, you can enable S3 versioning, and extend the
>>>>>> S3FIleIO so that each file path is not just the s3 path, but the s3 path +
>>>>>> version.
>>>>>>
>>>>>> But this is just what I think, let's see how others reply.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> From an Iceberg perspective, I understand what you are saying.
>>>>>>>
>>>>>>> A lot of our customers add/remove files to the table using scripts.
>>>>>>> The typical workflow would be:
>>>>>>> - Create Parquet files using other tools
>>>>>>> - Add these files to the Iceberg table
>>>>>>>
>>>>>>> Similarly, for removing Parquet files from the table. I understand
>>>>>>> that Iceberg doesn't delete the data file until all snapshots that refer to
>>>>>>> that data file expire. However, the customer can delete the file directly -
>>>>>>> they might understand that a query on a snapshot will fail.
>>>>>>>
>>>>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>>>>> is ok to return an error when a file referred to by a snapshot does not
>>>>>>> exist.
>>>>>>>
>>>>>>> This issue can be addressed by adding a version identifier (e.g.
>>>>>>> mtime) in the DataFile object and including this information in the
>>>>>>> manifest file. This ensures that snapshot reads are correct even when users
>>>>>>> make mistakes while adding/removing files to the table.
>>>>>>>
>>>>>>> We can work on this, if there is sufficient interest.
>>>>>>>
>>>>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In the real system each file would have a unique universal
>>>>>>>> identifier. When iceberg does a delete it doesn’t actually remove the file
>>>>>>>> it creates a new meta-data file which no longer includes that file. When
>>>>>>>> you attempt to access the table of time one you were actually just reading
>>>>>>>> the first meta-data file enough the new meta-data file which is missing the
>>>>>>>> entry for the deleted file.
>>>>>>>>
>>>>>>>> The only way to end up in the scenario you describe is if you were
>>>>>>>> manually deleting files and adding files using the iceberg internal API and
>>>>>>>> not some thing like spark or flink.
>>>>>>>>
>>>>>>>> What actually happens is some thing like at
>>>>>>>> T1 metadata says f1-uuid exists
>>>>>>>>
>>>>>>>> The data is deleted
>>>>>>>> T2 metadata no longer list f1
>>>>>>>>
>>>>>>>> New data is written
>>>>>>>> T3 metadata says f3_uuid now exists
>>>>>>>>
>>>>>>>> Data files are only physically deleted by iceberg through the
>>>>>>>> expire snapshots command. This removes the snapshot meta-data as well as
>>>>>>>> any data files which are only referred to by those snap shots that are
>>>>>>>> expired.
>>>>>>>>
>>>>>>>> If you are using the internal api (org.apache.iceberg.Table) then
>>>>>>>> it is your responsibility to not perform operations or delete files that
>>>>>>>> would violate the uniqueness of each snapshot. In this case you would
>>>>>>>> similarly solve the problem by just not physically deleting the file when
>>>>>>>> you remove it. Although usually having unique names every time you add data
>>>>>>>> is a good safety measure.
>>>>>>>>
>>>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I would like to understand if Iceberg supports the following
>>>>>>>> scenario:
>>>>>>>>
>>>>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet
>>>>>>>>    is also deleted from the filesystem
>>>>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>>>>    longer available in the filesystem
>>>>>>>>    - At time t3, f1.parquet is recreated and added back to the
>>>>>>>>    table
>>>>>>>>    - Querying table@t1 now results in potentially incorrect
>>>>>>>>    results since f1.parquet is now present in the filesystem
>>>>>>>>
>>>>>>>> Should there be a version identifier for each data-file in the
>>>>>>>> manifest file to handle such scenarios?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Vivek
>>>>>>>>
>>>>>>>>

-- 
Ryan Blue

Re: Querying older versions of an Iceberg table

Posted by Daniel Weeks <dw...@apache.org>.
Hey Vivek,

I think as you can see throughout this discussion there are a number of
issues with modifying the data files outside of Iceberg APIs.  To maintain
data integrity, it's advised to only operate on the data through Iceberg.

In many ways this is similar to trying to change the history of a single
commit within git without changing anything before or after the commit,
which isn't a good idea.

Beyond the issue of file size (which is used by some engines for spit
calculation and may result in failures).  Iceberg also tracks stats at the
file level which is also used for filtering and metadata operations.  I'd
be more concerned that if the stats don't align with the new data file, you
would get incorrect results (in addition to Russel's comments).

I feel like the correct approach for this would be to use the APIs to
perform a replace operation of the file(s) in question.  This doesn't fix
your original issue of time travel, but the only alternative for that would
be to reset back to the original snapshot and then replay all of the
commits after.  That might be a better approach if you want to rewrite
history, but I don't think there's an atomic way to do that at the moment.

Just a few things to consider,
-Dan






On Sun, May 16, 2021 at 10:32 PM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> Thinking about this more, if a Parquet file is deleted from the table and
> added back (both transactions using Iceberg APIs), this will lead to issues
> with snapshot reads on the table.
>
> If the fileSize changes, I believe the snapshot reads will fail (depends
> on how the reader handles these failures).
>
> If the fileSize remains unchanged but the data changed, the snapshot query
> now returns different data.
>
> On Mon, May 17, 2021 at 10:41 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>>
>>
>> On Mon, May 17, 2021 at 10:20 AM Jack Ye <ye...@gmail.com> wrote:
>>
>>> > Actually, the above is one of the reasons our customers overwrite
>>> Parquet files. They discover that the Parquet file contains incorrect data
>>> - they fix it by recreating a new Parquet file with the corrected data; and
>>> replace the old version of the file with the new version of the file
>>> bypassing Iceberg completely.
>>>
>>> If that is the case, I am a bit confused, isn't the current Iceberg
>>> behavior what they need? Do you want to prevent this from happening?
>>>
>>
>> Changing files without informing Iceberg will cause issues. I am assuming
>> that the Parquet/ORC readers will assume that the fileSize is accurate and
>> read the footer/tail assuming that to be true.
>>
>> If files are modified, Iceberg should be informed. My concern is that
>> even when Iceberg is informed, snapshot reads will return incorrect
>> results. I believe this should result in an error since snapshots should be
>> immutable and queries on snapshots should return the same results.
>>
>> > The manifest files do not have a provision for specifying the version
>>> of a data file. So, it is not really possible to read "s3 path + version"
>>> because the manifest files do not have the version information for the data
>>> file.
>>>
>>> What I mean is that you can define your file location as s3 path +
>>> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
>>> this value is recorded as the location in DataFile and your FileIO can
>>> parse this location, then there is no way to replace the file in that given
>>> location.
>>>
>>> > Part of the encryption proposal does include adding a data validation
>>> checksum or alike.
>>>
>>> Yes Russell made a good point, the proposed encryption support would
>>> make the process of replacing a file very hard if authenticated encryption
>>> is enabled (although still possible to decrypt and reuse the same AAD if
>>> the user is a super user). You can read more about it here
>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>>>
>>
>> Thanks, I will take a look at the proposal.
>>
>> I am looking for something that is supported by Iceberg natively.
>>
>> -Jack
>>>
>>>
>>> On Sun, May 16, 2021 at 9:38 PM <ru...@gmail.com> wrote:
>>>
>>>> Part of the encryption proposal does include adding a data validation
>>>> checksum or alike. That could potentially work to prevent corrupted
>>>> requests since you would get an error that the file does not match the
>>>> checksum but would still end up with lots of weird errors if users are
>>>> replacing files and deleting things outside of the iceberg api.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com>
>>>> wrote:
>>>>
>>>> 
>>>> The manifest files do not have a provision for specifying the version
>>>> of a data file. So, it is not really possible to read "s3 path + version"
>>>> because the manifest files do not have the version information for the data
>>>> file.
>>>>
>>>> https://iceberg.apache.org/spec/#manifests
>>>>
>>>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> I actually think there is an argument against that use case of
>>>>> returning an error after time t3. Maybe the user does want to change a row
>>>>> in a file directly and replace the file to get an updated result quickly
>>>>> bypassing the Iceberg API. In that case failing that query after t3 would
>>>>> block that use case. The statistics in manifest might be wrong, but we can
>>>>> further argue that the user can directly modify statistics and replace
>>>>> files all the way up to the snapshot to make sure everything continues to
>>>>> work.
>>>>>
>>>>> In general, if a user decides to bypass the contract set by Iceberg, I
>>>>> believe that we should not predict the behavior and compensate the system
>>>>> for that behavior, because users can bypass the contract in all different
>>>>> ways and it will open the door to satisfy many awkward use cases and in the
>>>>> end break assumptions to the fundamentals.
>>>>>
>>>>> In this case you described, I think the existing Iceberg behavior
>>>>> makes total sense. If you would like to achieve what you described later,
>>>>> you can potentially update your FileIO and leverage the versioning feature
>>>>> of the underlying storage to make sure that the file uploaded never has the
>>>>> same identifier, so that users cannot replace a file at t3. For example, if
>>>>> you are running on S3, you can enable S3 versioning, and extend the
>>>>> S3FIleIO so that each file path is not just the s3 path, but the s3 path +
>>>>> version.
>>>>>
>>>>> But this is just what I think, let's see how others reply.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
>>>>> wrote:
>>>>>
>>>>>> From an Iceberg perspective, I understand what you are saying.
>>>>>>
>>>>>> A lot of our customers add/remove files to the table using scripts.
>>>>>> The typical workflow would be:
>>>>>> - Create Parquet files using other tools
>>>>>> - Add these files to the Iceberg table
>>>>>>
>>>>>> Similarly, for removing Parquet files from the table. I understand
>>>>>> that Iceberg doesn't delete the data file until all snapshots that refer to
>>>>>> that data file expire. However, the customer can delete the file directly -
>>>>>> they might understand that a query on a snapshot will fail.
>>>>>>
>>>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>>>> is ok to return an error when a file referred to by a snapshot does not
>>>>>> exist.
>>>>>>
>>>>>> This issue can be addressed by adding a version identifier (e.g.
>>>>>> mtime) in the DataFile object and including this information in the
>>>>>> manifest file. This ensures that snapshot reads are correct even when users
>>>>>> make mistakes while adding/removing files to the table.
>>>>>>
>>>>>> We can work on this, if there is sufficient interest.
>>>>>>
>>>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>>>>
>>>>>>> In the real system each file would have a unique universal
>>>>>>> identifier. When iceberg does a delete it doesn’t actually remove the file
>>>>>>> it creates a new meta-data file which no longer includes that file. When
>>>>>>> you attempt to access the table of time one you were actually just reading
>>>>>>> the first meta-data file enough the new meta-data file which is missing the
>>>>>>> entry for the deleted file.
>>>>>>>
>>>>>>> The only way to end up in the scenario you describe is if you were
>>>>>>> manually deleting files and adding files using the iceberg internal API and
>>>>>>> not some thing like spark or flink.
>>>>>>>
>>>>>>> What actually happens is some thing like at
>>>>>>> T1 metadata says f1-uuid exists
>>>>>>>
>>>>>>> The data is deleted
>>>>>>> T2 metadata no longer list f1
>>>>>>>
>>>>>>> New data is written
>>>>>>> T3 metadata says f3_uuid now exists
>>>>>>>
>>>>>>> Data files are only physically deleted by iceberg through the expire
>>>>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>>>>> files which are only referred to by those snap shots that are expired.
>>>>>>>
>>>>>>> If you are using the internal api (org.apache.iceberg.Table) then it
>>>>>>> is your responsibility to not perform operations or delete files that would
>>>>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>>>>> solve the problem by just not physically deleting the file when you remove
>>>>>>> it. Although usually having unique names every time you add data is a good
>>>>>>> safety measure.
>>>>>>>
>>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to understand if Iceberg supports the following
>>>>>>> scenario:
>>>>>>>
>>>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet
>>>>>>>    is also deleted from the filesystem
>>>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>>>    longer available in the filesystem
>>>>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>>>>    since f1.parquet is now present in the filesystem
>>>>>>>
>>>>>>> Should there be a version identifier for each data-file in the
>>>>>>> manifest file to handle such scenarios?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Vivek
>>>>>>>
>>>>>>>

Re: Querying older versions of an Iceberg table

Posted by Vivekanand Vellanki <vi...@dremio.com>.
Thinking about this more, if a Parquet file is deleted from the table and
added back (both transactions using Iceberg APIs), this will lead to issues
with snapshot reads on the table.

If the fileSize changes, I believe the snapshot reads will fail (depends on
how the reader handles these failures).

If the fileSize remains unchanged but the data changed, the snapshot query
now returns different data.

On Mon, May 17, 2021 at 10:41 AM Vivekanand Vellanki <vi...@dremio.com>
wrote:

>
>
> On Mon, May 17, 2021 at 10:20 AM Jack Ye <ye...@gmail.com> wrote:
>
>> > Actually, the above is one of the reasons our customers overwrite
>> Parquet files. They discover that the Parquet file contains incorrect data
>> - they fix it by recreating a new Parquet file with the corrected data; and
>> replace the old version of the file with the new version of the file
>> bypassing Iceberg completely.
>>
>> If that is the case, I am a bit confused, isn't the current Iceberg
>> behavior what they need? Do you want to prevent this from happening?
>>
>
> Changing files without informing Iceberg will cause issues. I am assuming
> that the Parquet/ORC readers will assume that the fileSize is accurate and
> read the footer/tail assuming that to be true.
>
> If files are modified, Iceberg should be informed. My concern is that even
> when Iceberg is informed, snapshot reads will return incorrect results. I
> believe this should result in an error since snapshots should be immutable
> and queries on snapshots should return the same results.
>
> > The manifest files do not have a provision for specifying the version of
>> a data file. So, it is not really possible to read "s3 path + version"
>> because the manifest files do not have the version information for the data
>> file.
>>
>> What I mean is that you can define your file location as s3 path +
>> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
>> this value is recorded as the location in DataFile and your FileIO can
>> parse this location, then there is no way to replace the file in that given
>> location.
>>
>> > Part of the encryption proposal does include adding a data validation
>> checksum or alike.
>>
>> Yes Russell made a good point, the proposed encryption support would make
>> the process of replacing a file very hard if authenticated encryption is
>> enabled (although still possible to decrypt and reuse the same AAD if the
>> user is a super user). You can read more about it here
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>>
>
> Thanks, I will take a look at the proposal.
>
> I am looking for something that is supported by Iceberg natively.
>
> -Jack
>>
>>
>> On Sun, May 16, 2021 at 9:38 PM <ru...@gmail.com> wrote:
>>
>>> Part of the encryption proposal does include adding a data validation
>>> checksum or alike. That could potentially work to prevent corrupted
>>> requests since you would get an error that the file does not match the
>>> checksum but would still end up with lots of weird errors if users are
>>> replacing files and deleting things outside of the iceberg api.
>>>
>>> Sent from my iPhone
>>>
>>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>> 
>>> The manifest files do not have a provision for specifying the version of
>>> a data file. So, it is not really possible to read "s3 path + version"
>>> because the manifest files do not have the version information for the data
>>> file.
>>>
>>> https://iceberg.apache.org/spec/#manifests
>>>
>>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> I actually think there is an argument against that use case of
>>>> returning an error after time t3. Maybe the user does want to change a row
>>>> in a file directly and replace the file to get an updated result quickly
>>>> bypassing the Iceberg API. In that case failing that query after t3 would
>>>> block that use case. The statistics in manifest might be wrong, but we can
>>>> further argue that the user can directly modify statistics and replace
>>>> files all the way up to the snapshot to make sure everything continues to
>>>> work.
>>>>
>>>> In general, if a user decides to bypass the contract set by Iceberg, I
>>>> believe that we should not predict the behavior and compensate the system
>>>> for that behavior, because users can bypass the contract in all different
>>>> ways and it will open the door to satisfy many awkward use cases and in the
>>>> end break assumptions to the fundamentals.
>>>>
>>>> In this case you described, I think the existing Iceberg behavior makes
>>>> total sense. If you would like to achieve what you described later, you can
>>>> potentially update your FileIO and leverage the versioning feature of the
>>>> underlying storage to make sure that the file uploaded never has the same
>>>> identifier, so that users cannot replace a file at t3. For example, if you
>>>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
>>>> that each file path is not just the s3 path, but the s3 path + version.
>>>>
>>>> But this is just what I think, let's see how others reply.
>>>>
>>>> -Jack
>>>>
>>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
>>>> wrote:
>>>>
>>>>> From an Iceberg perspective, I understand what you are saying.
>>>>>
>>>>> A lot of our customers add/remove files to the table using scripts.
>>>>> The typical workflow would be:
>>>>> - Create Parquet files using other tools
>>>>> - Add these files to the Iceberg table
>>>>>
>>>>> Similarly, for removing Parquet files from the table. I understand
>>>>> that Iceberg doesn't delete the data file until all snapshots that refer to
>>>>> that data file expire. However, the customer can delete the file directly -
>>>>> they might understand that a query on a snapshot will fail.
>>>>>
>>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>>> is ok to return an error when a file referred to by a snapshot does not
>>>>> exist.
>>>>>
>>>>> This issue can be addressed by adding a version identifier (e.g.
>>>>> mtime) in the DataFile object and including this information in the
>>>>> manifest file. This ensures that snapshot reads are correct even when users
>>>>> make mistakes while adding/removing files to the table.
>>>>>
>>>>> We can work on this, if there is sufficient interest.
>>>>>
>>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>>>
>>>>>> In the real system each file would have a unique universal
>>>>>> identifier. When iceberg does a delete it doesn’t actually remove the file
>>>>>> it creates a new meta-data file which no longer includes that file. When
>>>>>> you attempt to access the table of time one you were actually just reading
>>>>>> the first meta-data file enough the new meta-data file which is missing the
>>>>>> entry for the deleted file.
>>>>>>
>>>>>> The only way to end up in the scenario you describe is if you were
>>>>>> manually deleting files and adding files using the iceberg internal API and
>>>>>> not some thing like spark or flink.
>>>>>>
>>>>>> What actually happens is some thing like at
>>>>>> T1 metadata says f1-uuid exists
>>>>>>
>>>>>> The data is deleted
>>>>>> T2 metadata no longer list f1
>>>>>>
>>>>>> New data is written
>>>>>> T3 metadata says f3_uuid now exists
>>>>>>
>>>>>> Data files are only physically deleted by iceberg through the expire
>>>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>>>> files which are only referred to by those snap shots that are expired.
>>>>>>
>>>>>> If you are using the internal api (org.apache.iceberg.Table) then it
>>>>>> is your responsibility to not perform operations or delete files that would
>>>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>>>> solve the problem by just not physically deleting the file when you remove
>>>>>> it. Although usually having unique names every time you add data is a good
>>>>>> safety measure.
>>>>>>
>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> Hi,
>>>>>>
>>>>>> I would like to understand if Iceberg supports the following scenario:
>>>>>>
>>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>>>>    also deleted from the filesystem
>>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>>    longer available in the filesystem
>>>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>>>    since f1.parquet is now present in the filesystem
>>>>>>
>>>>>> Should there be a version identifier for each data-file in the
>>>>>> manifest file to handle such scenarios?
>>>>>>
>>>>>> Thanks
>>>>>> Vivek
>>>>>>
>>>>>>

Re: Querying older versions of an Iceberg table

Posted by Vivekanand Vellanki <vi...@dremio.com>.
On Mon, May 17, 2021 at 10:20 AM Jack Ye <ye...@gmail.com> wrote:

> > Actually, the above is one of the reasons our customers overwrite
> Parquet files. They discover that the Parquet file contains incorrect data
> - they fix it by recreating a new Parquet file with the corrected data; and
> replace the old version of the file with the new version of the file
> bypassing Iceberg completely.
>
> If that is the case, I am a bit confused, isn't the current Iceberg
> behavior what they need? Do you want to prevent this from happening?
>

Changing files without informing Iceberg will cause issues. I am assuming
that the Parquet/ORC readers will assume that the fileSize is accurate and
read the footer/tail assuming that to be true.

If files are modified, Iceberg should be informed. My concern is that even
when Iceberg is informed, snapshot reads will return incorrect results. I
believe this should result in an error since snapshots should be immutable
and queries on snapshots should return the same results.

> The manifest files do not have a provision for specifying the version of
> a data file. So, it is not really possible to read "s3 path + version"
> because the manifest files do not have the version information for the data
> file.
>
> What I mean is that you can define your file location as s3 path +
> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
> this value is recorded as the location in DataFile and your FileIO can
> parse this location, then there is no way to replace the file in that given
> location.
>
> > Part of the encryption proposal does include adding a data validation
> checksum or alike.
>
> Yes Russell made a good point, the proposed encryption support would make
> the process of replacing a file very hard if authenticated encryption is
> enabled (although still possible to decrypt and reuse the same AAD if the
> user is a super user). You can read more about it here
> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>

Thanks, I will take a look at the proposal.

I am looking for something that is supported by Iceberg natively.

-Jack
>
>
> On Sun, May 16, 2021 at 9:38 PM <ru...@gmail.com> wrote:
>
>> Part of the encryption proposal does include adding a data validation
>> checksum or alike. That could potentially work to prevent corrupted
>> requests since you would get an error that the file does not match the
>> checksum but would still end up with lots of weird errors if users are
>> replacing files and deleting things outside of the iceberg api.
>>
>> Sent from my iPhone
>>
>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>> 
>> The manifest files do not have a provision for specifying the version of
>> a data file. So, it is not really possible to read "s3 path + version"
>> because the manifest files do not have the version information for the data
>> file.
>>
>> https://iceberg.apache.org/spec/#manifests
>>
>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>>
>>> I actually think there is an argument against that use case of returning
>>> an error after time t3. Maybe the user does want to change a row in a file
>>> directly and replace the file to get an updated result quickly bypassing
>>> the Iceberg API. In that case failing that query after t3 would block that
>>> use case. The statistics in manifest might be wrong, but we can further
>>> argue that the user can directly modify statistics and replace files all
>>> the way up to the snapshot to make sure everything continues to work.
>>>
>>> In general, if a user decides to bypass the contract set by Iceberg, I
>>> believe that we should not predict the behavior and compensate the system
>>> for that behavior, because users can bypass the contract in all different
>>> ways and it will open the door to satisfy many awkward use cases and in the
>>> end break assumptions to the fundamentals.
>>>
>>> In this case you described, I think the existing Iceberg behavior makes
>>> total sense. If you would like to achieve what you described later, you can
>>> potentially update your FileIO and leverage the versioning feature of the
>>> underlying storage to make sure that the file uploaded never has the same
>>> identifier, so that users cannot replace a file at t3. For example, if you
>>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
>>> that each file path is not just the s3 path, but the s3 path + version.
>>>
>>> But this is just what I think, let's see how others reply.
>>>
>>> -Jack
>>>
>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>>> From an Iceberg perspective, I understand what you are saying.
>>>>
>>>> A lot of our customers add/remove files to the table using scripts. The
>>>> typical workflow would be:
>>>> - Create Parquet files using other tools
>>>> - Add these files to the Iceberg table
>>>>
>>>> Similarly, for removing Parquet files from the table. I understand that
>>>> Iceberg doesn't delete the data file until all snapshots that refer to that
>>>> data file expire. However, the customer can delete the file directly - they
>>>> might understand that a query on a snapshot will fail.
>>>>
>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>> is ok to return an error when a file referred to by a snapshot does not
>>>> exist.
>>>>
>>>> This issue can be addressed by adding a version identifier (e.g. mtime)
>>>> in the DataFile object and including this information in the manifest file.
>>>> This ensures that snapshot reads are correct even when users make mistakes
>>>> while adding/removing files to the table.
>>>>
>>>> We can work on this, if there is sufficient interest.
>>>>
>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>>
>>>>> In the real system each file would have a unique universal identifier.
>>>>> When iceberg does a delete it doesn’t actually remove the file it creates a
>>>>> new meta-data file which no longer includes that file. When you attempt to
>>>>> access the table of time one you were actually just reading the first
>>>>> meta-data file enough the new meta-data file which is missing the entry for
>>>>> the deleted file.
>>>>>
>>>>> The only way to end up in the scenario you describe is if you were
>>>>> manually deleting files and adding files using the iceberg internal API and
>>>>> not some thing like spark or flink.
>>>>>
>>>>> What actually happens is some thing like at
>>>>> T1 metadata says f1-uuid exists
>>>>>
>>>>> The data is deleted
>>>>> T2 metadata no longer list f1
>>>>>
>>>>> New data is written
>>>>> T3 metadata says f3_uuid now exists
>>>>>
>>>>> Data files are only physically deleted by iceberg through the expire
>>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>>> files which are only referred to by those snap shots that are expired.
>>>>>
>>>>> If you are using the internal api (org.apache.iceberg.Table) then it
>>>>> is your responsibility to not perform operations or delete files that would
>>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>>> solve the problem by just not physically deleting the file when you remove
>>>>> it. Although usually having unique names every time you add data is a good
>>>>> safety measure.
>>>>>
>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Hi,
>>>>>
>>>>> I would like to understand if Iceberg supports the following scenario:
>>>>>
>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>>>    also deleted from the filesystem
>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>    longer available in the filesystem
>>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>>    since f1.parquet is now present in the filesystem
>>>>>
>>>>> Should there be a version identifier for each data-file in the
>>>>> manifest file to handle such scenarios?
>>>>>
>>>>> Thanks
>>>>> Vivek
>>>>>
>>>>>

Re: Querying older versions of an Iceberg table

Posted by Jack Ye <ye...@gmail.com>.
> Actually, the above is one of the reasons our customers overwrite Parquet
files. They discover that the Parquet file contains incorrect data - they
fix it by recreating a new Parquet file with the corrected data; and
replace the old version of the file with the new version of the file
bypassing Iceberg completely.

If that is the case, I am a bit confused, isn't the current Iceberg
behavior what they need? Do you want to prevent this from happening?

> The manifest files do not have a provision for specifying the version of
a data file. So, it is not really possible to read "s3 path + version"
because the manifest files do not have the version information for the data
file.

What I mean is that you can define your file location as s3 path + verison,
for example: s3://my-bucket/my/path/to/file?version=123456. If this value
is recorded as the location in DataFile and your FileIO can parse this
location, then there is no way to replace the file in that given location.

> Part of the encryption proposal does include adding a data validation
checksum or alike.

Yes Russell made a good point, the proposed encryption support would make
the process of replacing a file very hard if authenticated encryption is
enabled (although still possible to decrypt and reuse the same AAD if the
user is a super user). You can read more about it here
<https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>

-Jack


On Sun, May 16, 2021 at 9:38 PM <ru...@gmail.com> wrote:

> Part of the encryption proposal does include adding a data validation
> checksum or alike. That could potentially work to prevent corrupted
> requests since you would get an error that the file does not match the
> checksum but would still end up with lots of weird errors if users are
> replacing files and deleting things outside of the iceberg api.
>
> Sent from my iPhone
>
> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
> 
> The manifest files do not have a provision for specifying the version of a
> data file. So, it is not really possible to read "s3 path + version"
> because the manifest files do not have the version information for the data
> file.
>
> https://iceberg.apache.org/spec/#manifests
>
> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>
>> I actually think there is an argument against that use case of returning
>> an error after time t3. Maybe the user does want to change a row in a file
>> directly and replace the file to get an updated result quickly bypassing
>> the Iceberg API. In that case failing that query after t3 would block that
>> use case. The statistics in manifest might be wrong, but we can further
>> argue that the user can directly modify statistics and replace files all
>> the way up to the snapshot to make sure everything continues to work.
>>
>> In general, if a user decides to bypass the contract set by Iceberg, I
>> believe that we should not predict the behavior and compensate the system
>> for that behavior, because users can bypass the contract in all different
>> ways and it will open the door to satisfy many awkward use cases and in the
>> end break assumptions to the fundamentals.
>>
>> In this case you described, I think the existing Iceberg behavior makes
>> total sense. If you would like to achieve what you described later, you can
>> potentially update your FileIO and leverage the versioning feature of the
>> underlying storage to make sure that the file uploaded never has the same
>> identifier, so that users cannot replace a file at t3. For example, if you
>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
>> that each file path is not just the s3 path, but the s3 path + version.
>>
>> But this is just what I think, let's see how others reply.
>>
>> -Jack
>>
>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>> From an Iceberg perspective, I understand what you are saying.
>>>
>>> A lot of our customers add/remove files to the table using scripts. The
>>> typical workflow would be:
>>> - Create Parquet files using other tools
>>> - Add these files to the Iceberg table
>>>
>>> Similarly, for removing Parquet files from the table. I understand that
>>> Iceberg doesn't delete the data file until all snapshots that refer to that
>>> data file expire. However, the customer can delete the file directly - they
>>> might understand that a query on a snapshot will fail.
>>>
>>> I am concerned that an unintentional mistake in updating the Iceberg
>>> table results in incorrect results while querying an Iceberg snapshot. It
>>> is ok to return an error when a file referred to by a snapshot does not
>>> exist.
>>>
>>> This issue can be addressed by adding a version identifier (e.g. mtime)
>>> in the DataFile object and including this information in the manifest file.
>>> This ensures that snapshot reads are correct even when users make mistakes
>>> while adding/removing files to the table.
>>>
>>> We can work on this, if there is sufficient interest.
>>>
>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>
>>>> In the real system each file would have a unique universal identifier.
>>>> When iceberg does a delete it doesn’t actually remove the file it creates a
>>>> new meta-data file which no longer includes that file. When you attempt to
>>>> access the table of time one you were actually just reading the first
>>>> meta-data file enough the new meta-data file which is missing the entry for
>>>> the deleted file.
>>>>
>>>> The only way to end up in the scenario you describe is if you were
>>>> manually deleting files and adding files using the iceberg internal API and
>>>> not some thing like spark or flink.
>>>>
>>>> What actually happens is some thing like at
>>>> T1 metadata says f1-uuid exists
>>>>
>>>> The data is deleted
>>>> T2 metadata no longer list f1
>>>>
>>>> New data is written
>>>> T3 metadata says f3_uuid now exists
>>>>
>>>> Data files are only physically deleted by iceberg through the expire
>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>> files which are only referred to by those snap shots that are expired.
>>>>
>>>> If you are using the internal api (org.apache.iceberg.Table) then it is
>>>> your responsibility to not perform operations or delete files that would
>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>> solve the problem by just not physically deleting the file when you remove
>>>> it. Although usually having unique names every time you add data is a good
>>>> safety measure.
>>>>
>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>>> wrote:
>>>>
>>>> 
>>>> Hi,
>>>>
>>>> I would like to understand if Iceberg supports the following scenario:
>>>>
>>>>    - At time t1, there's a table with a file f1.parquet
>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>>    also deleted from the filesystem
>>>>    - Querying table@t1 results in errors since f1.parquet is no longer
>>>>    available in the filesystem
>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>    since f1.parquet is now present in the filesystem
>>>>
>>>> Should there be a version identifier for each data-file in the manifest
>>>> file to handle such scenarios?
>>>>
>>>> Thanks
>>>> Vivek
>>>>
>>>>

Re: Querying older versions of an Iceberg table

Posted by ru...@gmail.com.
Part of the encryption proposal does include adding a data validation checksum or alike. That could potentially work to prevent corrupted requests since you would get an error that the file does not match the checksum but would still end up with lots of weird errors if users are replacing files and deleting things outside of the iceberg api.

Sent from my iPhone

> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> 
> The manifest files do not have a provision for specifying the version of a data file. So, it is not really possible to read "s3 path + version" because the manifest files do not have the version information for the data file.
> 
> https://iceberg.apache.org/spec/#manifests
> 
>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:
>> I actually think there is an argument against that use case of returning an error after time t3. Maybe the user does want to change a row in a file directly and replace the file to get an updated result quickly bypassing the Iceberg API. In that case failing that query after t3 would block that use case. The statistics in manifest might be wrong, but we can further argue that the user can directly modify statistics and replace files all the way up to the snapshot to make sure everything continues to work.
>> 
>> In general, if a user decides to bypass the contract set by Iceberg, I believe that we should not predict the behavior and compensate the system for that behavior, because users can bypass the contract in all different ways and it will open the door to satisfy many awkward use cases and in the end break assumptions to the fundamentals. 
>> 
>> In this case you described, I think the existing Iceberg behavior makes total sense. If you would like to achieve what you described later, you can potentially update your FileIO and leverage the versioning feature of the underlying storage to make sure that the file uploaded never has the same identifier, so that users cannot replace a file at t3. For example, if you are running on S3, you can enable S3 versioning, and extend the S3FIleIO so that each file path is not just the s3 path, but the s3 path + version.
>> 
>> But this is just what I think, let's see how others reply.
>> 
>> -Jack 
>> 
>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com> wrote:
>>> From an Iceberg perspective, I understand what you are saying.
>>> 
>>> A lot of our customers add/remove files to the table using scripts. The typical workflow would be:
>>> - Create Parquet files using other tools
>>> - Add these files to the Iceberg table
>>> 
>>> Similarly, for removing Parquet files from the table. I understand that Iceberg doesn't delete the data file until all snapshots that refer to that data file expire. However, the customer can delete the file directly - they might understand that a query on a snapshot will fail.
>>> 
>>> I am concerned that an unintentional mistake in updating the Iceberg table results in incorrect results while querying an Iceberg snapshot. It is ok to return an error when a file referred to by a snapshot does not exist.
>>> 
>>> This issue can be addressed by adding a version identifier (e.g. mtime) in the DataFile object and including this information in the manifest file. This ensures that snapshot reads are correct even when users make mistakes while adding/removing files to the table.
>>> 
>>> We can work on this, if there is sufficient interest.
>>> 
>>>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>>> In the real system each file would have a unique universal identifier. When iceberg does a delete it doesn’t actually remove the file it creates a new meta-data file which no longer includes that file. When you attempt to access the table of time one you were actually just reading the first meta-data file enough the new meta-data file which is missing the entry for the deleted file.
>>>> 
>>>> The only way to end up in the scenario you describe is if you were manually deleting files and adding files using the iceberg internal API and not some thing like spark or flink.
>>>> 
>>>> What actually happens is some thing like at 
>>>> T1 metadata says f1-uuid exists
>>>> 
>>>> The data is deleted
>>>> T2 metadata no longer list f1
>>>> 
>>>> New data is written
>>>> T3 metadata says f3_uuid now exists
>>>> 
>>>> Data files are only physically deleted by iceberg through the expire snapshots command. This removes the snapshot meta-data as well as any data files which are only referred to by those snap shots that are expired.
>>>> 
>>>> If you are using the internal api (org.apache.iceberg.Table) then it is your responsibility to not perform operations or delete files that would violate the uniqueness of each snapshot. In this case you would similarly solve the problem by just not physically deleting the file when you remove it. Although usually having unique names every time you add data is a good safety measure.
>>>> 
>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
>>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I would like to understand if Iceberg supports the following scenario:
>>>>> At time t1, there's a table with a file f1.parquet
>>>>> At time t2, f1.parquet is removed from the table. f1.parquet is also deleted from the filesystem
>>>>> Querying table@t1 results in errors since f1.parquet is no longer available in the filesystem
>>>>> At time t3, f1.parquet is recreated and added back to the table
>>>>> Querying table@t1 now results in potentially incorrect results since f1.parquet is now present in the filesystem
>>>>> Should there be a version identifier for each data-file in the manifest file to handle such scenarios?
>>>>> 
>>>>> Thanks
>>>>> Vivek
>>>>> 

Re: Querying older versions of an Iceberg table

Posted by Vivekanand Vellanki <vi...@dremio.com>.
The manifest files do not have a provision for specifying the version of a
data file. So, it is not really possible to read "s3 path + version"
because the manifest files do not have the version information for the data
file.

https://iceberg.apache.org/spec/#manifests

On Mon, May 17, 2021 at 9:39 AM Jack Ye <ye...@gmail.com> wrote:

> I actually think there is an argument against that use case of returning
> an error after time t3. Maybe the user does want to change a row in a file
> directly and replace the file to get an updated result quickly bypassing
> the Iceberg API. In that case failing that query after t3 would block that
> use case. The statistics in manifest might be wrong, but we can further
> argue that the user can directly modify statistics and replace files all
> the way up to the snapshot to make sure everything continues to work.
>
> In general, if a user decides to bypass the contract set by Iceberg, I
> believe that we should not predict the behavior and compensate the system
> for that behavior, because users can bypass the contract in all different
> ways and it will open the door to satisfy many awkward use cases and in the
> end break assumptions to the fundamentals.
>
> In this case you described, I think the existing Iceberg behavior makes
> total sense. If you would like to achieve what you described later, you can
> potentially update your FileIO and leverage the versioning feature of the
> underlying storage to make sure that the file uploaded never has the same
> identifier, so that users cannot replace a file at t3. For example, if you
> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
> that each file path is not just the s3 path, but the s3 path + version.
>
> But this is just what I think, let's see how others reply.
>
> -Jack
>
> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> From an Iceberg perspective, I understand what you are saying.
>>
>> A lot of our customers add/remove files to the table using scripts. The
>> typical workflow would be:
>> - Create Parquet files using other tools
>> - Add these files to the Iceberg table
>>
>> Similarly, for removing Parquet files from the table. I understand that
>> Iceberg doesn't delete the data file until all snapshots that refer to that
>> data file expire. However, the customer can delete the file directly - they
>> might understand that a query on a snapshot will fail.
>>
>> I am concerned that an unintentional mistake in updating the Iceberg
>> table results in incorrect results while querying an Iceberg snapshot. It
>> is ok to return an error when a file referred to by a snapshot does not
>> exist.
>>
>> This issue can be addressed by adding a version identifier (e.g. mtime)
>> in the DataFile object and including this information in the manifest file.
>> This ensures that snapshot reads are correct even when users make mistakes
>> while adding/removing files to the table.
>>
>> We can work on this, if there is sufficient interest.
>>
>> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>>
>>> In the real system each file would have a unique universal identifier.
>>> When iceberg does a delete it doesn’t actually remove the file it creates a
>>> new meta-data file which no longer includes that file. When you attempt to
>>> access the table of time one you were actually just reading the first
>>> meta-data file enough the new meta-data file which is missing the entry for
>>> the deleted file.
>>>
>>> The only way to end up in the scenario you describe is if you were
>>> manually deleting files and adding files using the iceberg internal API and
>>> not some thing like spark or flink.
>>>
>>> What actually happens is some thing like at
>>> T1 metadata says f1-uuid exists
>>>
>>> The data is deleted
>>> T2 metadata no longer list f1
>>>
>>> New data is written
>>> T3 metadata says f3_uuid now exists
>>>
>>> Data files are only physically deleted by iceberg through the expire
>>> snapshots command. This removes the snapshot meta-data as well as any data
>>> files which are only referred to by those snap shots that are expired.
>>>
>>> If you are using the internal api (org.apache.iceberg.Table) then it is
>>> your responsibility to not perform operations or delete files that would
>>> violate the uniqueness of each snapshot. In this case you would similarly
>>> solve the problem by just not physically deleting the file when you remove
>>> it. Although usually having unique names every time you add data is a good
>>> safety measure.
>>>
>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>> 
>>> Hi,
>>>
>>> I would like to understand if Iceberg supports the following scenario:
>>>
>>>    - At time t1, there's a table with a file f1.parquet
>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>    also deleted from the filesystem
>>>    - Querying table@t1 results in errors since f1.parquet is no longer
>>>    available in the filesystem
>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>    - Querying table@t1 now results in potentially incorrect results
>>>    since f1.parquet is now present in the filesystem
>>>
>>> Should there be a version identifier for each data-file in the manifest
>>> file to handle such scenarios?
>>>
>>> Thanks
>>> Vivek
>>>
>>>

Re: Querying older versions of an Iceberg table

Posted by Jack Ye <ye...@gmail.com>.
I actually think there is an argument against that use case of returning an
error after time t3. Maybe the user does want to change a row in a file
directly and replace the file to get an updated result quickly bypassing
the Iceberg API. In that case failing that query after t3 would block that
use case. The statistics in manifest might be wrong, but we can further
argue that the user can directly modify statistics and replace files all
the way up to the snapshot to make sure everything continues to work.

In general, if a user decides to bypass the contract set by Iceberg, I
believe that we should not predict the behavior and compensate the system
for that behavior, because users can bypass the contract in all different
ways and it will open the door to satisfy many awkward use cases and in the
end break assumptions to the fundamentals.

In this case you described, I think the existing Iceberg behavior makes
total sense. If you would like to achieve what you described later, you can
potentially update your FileIO and leverage the versioning feature of the
underlying storage to make sure that the file uploaded never has the same
identifier, so that users cannot replace a file at t3. For example, if you
are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
that each file path is not just the s3 path, but the s3 path + version.

But this is just what I think, let's see how others reply.

-Jack

On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> From an Iceberg perspective, I understand what you are saying.
>
> A lot of our customers add/remove files to the table using scripts. The
> typical workflow would be:
> - Create Parquet files using other tools
> - Add these files to the Iceberg table
>
> Similarly, for removing Parquet files from the table. I understand that
> Iceberg doesn't delete the data file until all snapshots that refer to that
> data file expire. However, the customer can delete the file directly - they
> might understand that a query on a snapshot will fail.
>
> I am concerned that an unintentional mistake in updating the Iceberg table
> results in incorrect results while querying an Iceberg snapshot. It is ok
> to return an error when a file referred to by a snapshot does not exist.
>
> This issue can be addressed by adding a version identifier (e.g. mtime) in
> the DataFile object and including this information in the manifest file.
> This ensures that snapshot reads are correct even when users make mistakes
> while adding/removing files to the table.
>
> We can work on this, if there is sufficient interest.
>
> On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:
>
>> In the real system each file would have a unique universal identifier.
>> When iceberg does a delete it doesn’t actually remove the file it creates a
>> new meta-data file which no longer includes that file. When you attempt to
>> access the table of time one you were actually just reading the first
>> meta-data file enough the new meta-data file which is missing the entry for
>> the deleted file.
>>
>> The only way to end up in the scenario you describe is if you were
>> manually deleting files and adding files using the iceberg internal API and
>> not some thing like spark or flink.
>>
>> What actually happens is some thing like at
>> T1 metadata says f1-uuid exists
>>
>> The data is deleted
>> T2 metadata no longer list f1
>>
>> New data is written
>> T3 metadata says f3_uuid now exists
>>
>> Data files are only physically deleted by iceberg through the expire
>> snapshots command. This removes the snapshot meta-data as well as any data
>> files which are only referred to by those snap shots that are expired.
>>
>> If you are using the internal api (org.apache.iceberg.Table) then it is
>> your responsibility to not perform operations or delete files that would
>> violate the uniqueness of each snapshot. In this case you would similarly
>> solve the problem by just not physically deleting the file when you remove
>> it. Although usually having unique names every time you add data is a good
>> safety measure.
>>
>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>> 
>> Hi,
>>
>> I would like to understand if Iceberg supports the following scenario:
>>
>>    - At time t1, there's a table with a file f1.parquet
>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>    also deleted from the filesystem
>>    - Querying table@t1 results in errors since f1.parquet is no longer
>>    available in the filesystem
>>    - At time t3, f1.parquet is recreated and added back to the table
>>    - Querying table@t1 now results in potentially incorrect results
>>    since f1.parquet is now present in the filesystem
>>
>> Should there be a version identifier for each data-file in the manifest
>> file to handle such scenarios?
>>
>> Thanks
>> Vivek
>>
>>

Re: Querying older versions of an Iceberg table

Posted by Vivekanand Vellanki <vi...@dremio.com>.
From an Iceberg perspective, I understand what you are saying.

A lot of our customers add/remove files to the table using scripts. The
typical workflow would be:
- Create Parquet files using other tools
- Add these files to the Iceberg table

Similarly, for removing Parquet files from the table. I understand that
Iceberg doesn't delete the data file until all snapshots that refer to that
data file expire. However, the customer can delete the file directly - they
might understand that a query on a snapshot will fail.

I am concerned that an unintentional mistake in updating the Iceberg table
results in incorrect results while querying an Iceberg snapshot. It is ok
to return an error when a file referred to by a snapshot does not exist.

This issue can be addressed by adding a version identifier (e.g. mtime) in
the DataFile object and including this information in the manifest file.
This ensures that snapshot reads are correct even when users make mistakes
while adding/removing files to the table.

We can work on this, if there is sufficient interest.

On Sun, May 16, 2021 at 8:34 PM <ru...@gmail.com> wrote:

> In the real system each file would have a unique universal identifier.
> When iceberg does a delete it doesn’t actually remove the file it creates a
> new meta-data file which no longer includes that file. When you attempt to
> access the table of time one you were actually just reading the first
> meta-data file enough the new meta-data file which is missing the entry for
> the deleted file.
>
> The only way to end up in the scenario you describe is if you were
> manually deleting files and adding files using the iceberg internal API and
> not some thing like spark or flink.
>
> What actually happens is some thing like at
> T1 metadata says f1-uuid exists
>
> The data is deleted
> T2 metadata no longer list f1
>
> New data is written
> T3 metadata says f3_uuid now exists
>
> Data files are only physically deleted by iceberg through the expire
> snapshots command. This removes the snapshot meta-data as well as any data
> files which are only referred to by those snap shots that are expired.
>
> If you are using the internal api (org.apache.iceberg.Table) then it is
> your responsibility to not perform operations or delete files that would
> violate the uniqueness of each snapshot. In this case you would similarly
> solve the problem by just not physically deleting the file when you remove
> it. Although usually having unique names every time you add data is a good
> safety measure.
>
> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
>
> 
> Hi,
>
> I would like to understand if Iceberg supports the following scenario:
>
>    - At time t1, there's a table with a file f1.parquet
>    - At time t2, f1.parquet is removed from the table. f1.parquet is also
>    deleted from the filesystem
>    - Querying table@t1 results in errors since f1.parquet is no longer
>    available in the filesystem
>    - At time t3, f1.parquet is recreated and added back to the table
>    - Querying table@t1 now results in potentially incorrect results since
>    f1.parquet is now present in the filesystem
>
> Should there be a version identifier for each data-file in the manifest
> file to handle such scenarios?
>
> Thanks
> Vivek
>
>

Re: Querying older versions of an Iceberg table

Posted by ru...@gmail.com.
In the real system each file would have a unique universal identifier. When iceberg does a delete it doesn’t actually remove the file it creates a new meta-data file which no longer includes that file. When you attempt to access the table of time one you were actually just reading the first meta-data file enough the new meta-data file which is missing the entry for the deleted file.

The only way to end up in the scenario you describe is if you were manually deleting files and adding files using the iceberg internal API and not some thing like spark or flink.

What actually happens is some thing like at 
T1 metadata says f1-uuid exists

The data is deleted
T2 metadata no longer list f1

New data is written
T3 metadata says f3_uuid now exists

Data files are only physically deleted by iceberg through the expire snapshots command. This removes the snapshot meta-data as well as any data files which are only referred to by those snap shots that are expired.

If you are using the internal api (org.apache.iceberg.Table) then it is your responsibility to not perform operations or delete files that would violate the uniqueness of each snapshot. In this case you would similarly solve the problem by just not physically deleting the file when you remove it. Although usually having unique names every time you add data is a good safety measure.

> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> 
> Hi,
> 
> I would like to understand if Iceberg supports the following scenario:
> At time t1, there's a table with a file f1.parquet
> At time t2, f1.parquet is removed from the table. f1.parquet is also deleted from the filesystem
> Querying table@t1 results in errors since f1.parquet is no longer available in the filesystem
> At time t3, f1.parquet is recreated and added back to the table
> Querying table@t1 now results in potentially incorrect results since f1.parquet is now present in the filesystem
> Should there be a version identifier for each data-file in the manifest file to handle such scenarios?
> 
> Thanks
> Vivek
>