You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/07/03 00:29:41 UTC

Re: Updates/Deletes/Upserts in Iceberg

Sorry I didn't get back to this thread last week. Let's try to have a video
call to sync up on this next week. What days would work for everyone?

rb

On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com> wrote:

> With regards to operation values. Currently they are:
>
>    - append: data files were added and no files were removed.
>    - replace: data files were rewritten with the same data; i.e.,
>    compaction, changing the data file format, or relocating data files.
>    - overwrite: data files were deleted and added in a logical overwrite
>    operation.
>    - delete: data files were removed and their contents logically deleted.
>
> If deletion files (with or without data files) are appended to the
> dataset, will we consider that an `append` operation? If so, if deletion
> and/or data files are appended, and whole files are also deleted, will we
> consider that an `overwrite`?
>
> Given that the only apparent purpose of the operation field is to optimize
> snapshot expiration the above seems to meet its needs. An incremental
> reader can also skip `replace` snapshots but no others. Once it decides to
> read a snapshot I don't think there's any difference in how it processes
> the data for append/overwrite/delete cases.
>
> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>> since they apply to a specific file. They’re not harmful, but the don’t
>> seem relevant.
>>
>> These delete files will probably contain a path and an offset and could
>> contain deletes for multiple files. In that case, the sequence number can
>> be used to eliminate delete files that don’t need to be applied to a
>> particular data file, just like the column equality deletes. Likewise, it
>> can be used to drop the delete files when there are no data files with an
>> older sequence number.
>>
>> I don’t understand the purpose of the min sequence number, nor what the
>> “min data seq” is.
>>
>> Min sequence number would be used for pruning delete files without
>> reading all the manifests to find out if there are old data files. If no
>> manifest with data for a partition contains a file older than some sequence
>> number N, then any delete file with a sequence number < N can be removed.
>>
> OK, so the minimum sequence number is an attribute of manifest files.
> Sounds good. It can likely permit us to optimize compaction operations as
> well (i.e., you can easily limit the operation to a subset of manifest
> files as long as they are the oldest ones).
>
>
>> The “min data seq” is the minimum sequence number of a data file. That
>> seems like what we actually want for the pruning I described above.
>>
> I would expect a data file (appended rows or deletions by column value) to
> have a single sequence number that applies to the whole file. Even a
> delete-by-file-and-offset file can do with only a single sequence number
> (which must be larger than the sequence numbers of all deleted files). Why
> do we need a "minimum" data sequence per file?
>
>> Off the top of my head [supporting non-key delete] requires adding
>> additional information to the manifest file, indicating the columns that
>> are used for the deletion. Only equality would be supported; if multiple
>> columns were used, they would be combined with boolean-and. I don’t see
>> anything too tricky about it.
>>
>> Yes, exactly. I actually phrased it wrong initially. I think it would be
>> simple to extend the equality deletes to do this. We just need a way to
>> have global scope, not just partition scope.
>>
> I don't think anything special needs to be done with regards to
> scoping/partitioning of delete files. When scanning one or more data files,
> one must also consider any and all deletion files that could apply to them.
> The only way to prune deletion files from consideration is:
>
>    1. All of your data files have at least one partition column in common.
>    2. The deletion file is also partitioned on that column (at least).
>    3. The value sets of the data files do not overlap the value sets of
>    the deletion files in that column.
>
>  So given a dataset of sessions that is partitioned by device form factor
> and date, for example, you could have a delete (user_id=9876) in a deletion
> file that is not partitioned. And it would be "in scope" for all of those
> data files.
>
> If you had the same dataset partitioned by hash(user_id) and your deletes
> were _also_ partitioned by hash(user_id) you would be able to prune those
> deletes while scanning the sessions.
>
>> If we add this on a per-deletion file basis it is not clear if there is
>> any relevance in preserving the concept of a unique row ID.
>>
>> Agreed. That’s why I’ve been steering us away from the debate about
>> whether keys are unique or not. Either way, a natural key delete must
>> delete all of the records it matches.
>>
>> I would assume that the maximum sequence number should appear in the
>> table metadata
>>
>> Agreed.
>>
>> [W]ould you make it optional to assign a sequence number to a snapshot?
>> “Replace” snapshots would not need one.
>>
>> The only requirement is that it is monotonically increasing. If one isn’t
>> used, we don’t have to increment. I’d say it is up to the implementation to
>> decide. I would probably increment it every time to avoid errors.
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Posted by Erik Wright <er...@shopify.com.INVALID>.
I have a new version
<https://docs.google.com/document/d/1FMKh_SQ6xSUUmoCA8LerTkzIxDUN5JbStQp5Hzot4eo/edit?usp=sharing>
of the proposal, updated to reflect our discussion.

I have misgivings about the elimination of the unique row ID. It makes
reading the dataset potentially much more expensive. We eliminated it in
order to support the idea of "global" deletes but might want to revisit
whether we need to handle that use case now. If we do, we may want to
handle it in a way that is distinct from the hopefully more common case of
deleting by unique row ID.

On Wed, Jul 3, 2019 at 2:44 PM Owen O'Malley <ow...@gmail.com> wrote:

> It works for me too.
>
> .. Owen
>
> On Jul 3, 2019, at 11:27, Anton Okolnychyi <ao...@apple.com.invalid>
> wrote:
>
> Works for me too.
>
> On 3 Jul 2019, at 19:09, Erik Wright <er...@shopify.com.INVALID>
> wrote:
>
> That works for me.
>
> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> How about 9AM PDT on Friday, 5 July then?
>>
>> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> I'd like to call in, but I'm out Thursday. Friday would work except 11am
>>> to 1pm pdt.
>>>
>>> .. Owen
>>>
>>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I'm available Thursday and Friday this week as well, but it's a holiday
>>>> in the US so some people may be out. If there are no objections from anyone
>>>> that would like to attend, then I'm up for that.
>>>>
>>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com>
>>>> wrote:
>>>>
>>>>> I apologize for the delay on my side. I’ll still have to go through
>>>>> the last emails. I am available on Thursday/Friday this week and would be
>>>>> great to sync.
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>>>
>>>>> Sorry I didn't get back to this thread last week. Let's try to have a
>>>>> video call to sync up on this next week. What days would work for everyone?
>>>>>
>>>>> rb
>>>>>
>>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com>
>>>>> wrote:
>>>>>
>>>>>> With regards to operation values. Currently they are:
>>>>>>
>>>>>>    - append: data files were added and no files were removed.
>>>>>>    - replace: data files were rewritten with the same data; i.e.,
>>>>>>    compaction, changing the data file format, or relocating data files.
>>>>>>    - overwrite: data files were deleted and added in a logical
>>>>>>    overwrite operation.
>>>>>>    - delete: data files were removed and their contents logically
>>>>>>    deleted.
>>>>>>
>>>>>> If deletion files (with or without data files) are appended to the
>>>>>> dataset, will we consider that an `append` operation? If so, if deletion
>>>>>> and/or data files are appended, and whole files are also deleted, will we
>>>>>> consider that an `overwrite`?
>>>>>>
>>>>>> Given that the only apparent purpose of the operation field is to
>>>>>> optimize snapshot expiration the above seems to meet its needs. An
>>>>>> incremental reader can also skip `replace` snapshots but no others. Once it
>>>>>> decides to read a snapshot I don't think there's any difference in how it
>>>>>> processes the data for append/overwrite/delete cases.
>>>>>>
>>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>>>>> seem relevant.
>>>>>>>
>>>>>>> These delete files will probably contain a path and an offset and
>>>>>>> could contain deletes for multiple files. In that case, the sequence number
>>>>>>> can be used to eliminate delete files that don’t need to be applied to a
>>>>>>> particular data file, just like the column equality deletes. Likewise, it
>>>>>>> can be used to drop the delete files when there are no data files with an
>>>>>>> older sequence number.
>>>>>>>
>>>>>>> I don’t understand the purpose of the min sequence number, nor what
>>>>>>> the “min data seq” is.
>>>>>>>
>>>>>>> Min sequence number would be used for pruning delete files without
>>>>>>> reading all the manifests to find out if there are old data files. If no
>>>>>>> manifest with data for a partition contains a file older than some sequence
>>>>>>> number N, then any delete file with a sequence number < N can be removed.
>>>>>>>
>>>>>> OK, so the minimum sequence number is an attribute of manifest files.
>>>>>> Sounds good. It can likely permit us to optimize compaction operations as
>>>>>> well (i.e., you can easily limit the operation to a subset of manifest
>>>>>> files as long as they are the oldest ones).
>>>>>>
>>>>>>
>>>>>>> The “min data seq” is the minimum sequence number of a data file.
>>>>>>> That seems like what we actually want for the pruning I described above.
>>>>>>>
>>>>>> I would expect a data file (appended rows or deletions by column
>>>>>> value) to have a single sequence number that applies to the whole file.
>>>>>> Even a delete-by-file-and-offset file can do with only a single sequence
>>>>>> number (which must be larger than the sequence numbers of all deleted
>>>>>> files). Why do we need a "minimum" data sequence per file?
>>>>>>
>>>>>>> Off the top of my head [supporting non-key delete] requires adding
>>>>>>> additional information to the manifest file, indicating the columns that
>>>>>>> are used for the deletion. Only equality would be supported; if multiple
>>>>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>>>>> anything too tricky about it.
>>>>>>>
>>>>>>> Yes, exactly. I actually phrased it wrong initially. I think it
>>>>>>> would be simple to extend the equality deletes to do this. We just need a
>>>>>>> way to have global scope, not just partition scope.
>>>>>>>
>>>>>> I don't think anything special needs to be done with regards to
>>>>>> scoping/partitioning of delete files. When scanning one or more data files,
>>>>>> one must also consider any and all deletion files that could apply to them.
>>>>>> The only way to prune deletion files from consideration is:
>>>>>>
>>>>>>    1. All of your data files have at least one partition column in
>>>>>>    common.
>>>>>>    2. The deletion file is also partitioned on that column (at
>>>>>>    least).
>>>>>>    3. The value sets of the data files do not overlap the value sets
>>>>>>    of the deletion files in that column.
>>>>>>
>>>>>>  So given a dataset of sessions that is partitioned by device form
>>>>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>>>>> deletion file that is not partitioned. And it would be "in scope" for all
>>>>>> of those data files.
>>>>>>
>>>>>> If you had the same dataset partitioned by hash(user_id) and your
>>>>>> deletes were _also_ partitioned by hash(user_id) you would be able to prune
>>>>>> those deletes while scanning the sessions.
>>>>>>
>>>>>>> If we add this on a per-deletion file basis it is not clear if there
>>>>>>> is any relevance in preserving the concept of a unique row ID.
>>>>>>>
>>>>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>>>>> whether keys are unique or not. Either way, a natural key delete must
>>>>>>> delete all of the records it matches.
>>>>>>>
>>>>>>> I would assume that the maximum sequence number should appear in the
>>>>>>> table metadata
>>>>>>>
>>>>>>> Agreed.
>>>>>>>
>>>>>>> [W]ould you make it optional to assign a sequence number to a
>>>>>>> snapshot? “Replace” snapshots would not need one.
>>>>>>>
>>>>>>> The only requirement is that it is monotonically increasing. If one
>>>>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>>>>> implementation to decide. I would probably increment it every time to avoid
>>>>>>> errors.
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

Re: Updates/Deletes/Upserts in Iceberg

Posted by Owen O'Malley <ow...@gmail.com>.
It works for me too. 

.. Owen

> On Jul 3, 2019, at 11:27, Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> 
> Works for me too.
> 
>> On 3 Jul 2019, at 19:09, Erik Wright <er...@shopify.com.INVALID> wrote:
>> 
>> That works for me.
>> 
>> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>>> How about 9AM PDT on Friday, 5 July then?
>>> 
>>>> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <ow...@gmail.com> wrote:
>>>> I'd like to call in, but I'm out Thursday. Friday would work except 11am to 1pm pdt.
>>>> 
>>>> .. Owen
>>>> 
>>>>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
>>>>> I'm available Thursday and Friday this week as well, but it's a holiday in the US so some people may be out. If there are no objections from anyone that would like to attend, then I'm up for that.
>>>>> 
>>>>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com> wrote:
>>>>>> I apologize for the delay on my side. I’ll still have to go through the last emails. I am available on Thursday/Friday this week and would be great to sync.
>>>>>> 
>>>>>> Thanks,
>>>>>> Anton
>>>>>> 
>>>>>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>>>>> 
>>>>>>> Sorry I didn't get back to this thread last week. Let's try to have a video call to sync up on this next week. What days would work for everyone?
>>>>>>> 
>>>>>>> rb
>>>>>>> 
>>>>>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com> wrote:
>>>>>>>> With regards to operation values. Currently they are:
>>>>>>>> append: data files were added and no files were removed.
>>>>>>>> replace: data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files.
>>>>>>>> overwrite: data files were deleted and added in a logical overwrite operation.
>>>>>>>> delete: data files were removed and their contents logically deleted.
>>>>>>>> If deletion files (with or without data files) are appended to the dataset, will we consider that an `append` operation? If so, if deletion and/or data files are appended, and whole files are also deleted, will we consider that an `overwrite`?
>>>>>>>> 
>>>>>>>> Given that the only apparent purpose of the operation field is to optimize snapshot expiration the above seems to meet its needs. An incremental reader can also skip `replace` snapshots but no others. Once it decides to read a snapshot I don't think there's any difference in how it processes the data for append/overwrite/delete cases.
>>>>>>>> 
>>>>>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes, since they apply to a specific file. They’re not harmful, but the don’t seem relevant.
>>>>>>>>> 
>>>>>>>>> These delete files will probably contain a path and an offset and could contain deletes for multiple files. In that case, the sequence number can be used to eliminate delete files that don’t need to be applied to a particular data file, just like the column equality deletes. Likewise, it can be used to drop the delete files when there are no data files with an older sequence number.
>>>>>>>>> 
>>>>>>>>> I don’t understand the purpose of the min sequence number, nor what the “min data seq” is.
>>>>>>>>> 
>>>>>>>>> Min sequence number would be used for pruning delete files without reading all the manifests to find out if there are old data files. If no manifest with data for a partition contains a file older than some sequence number N, then any delete file with a sequence number < N can be removed.
>>>>>>>>> 
>>>>>>>> OK, so the minimum sequence number is an attribute of manifest files. Sounds good. It can likely permit us to optimize compaction operations as well (i.e., you can easily limit the operation to a subset of manifest files as long as they are the oldest ones).
>>>>>>>>  
>>>>>>>>> The “min data seq” is the minimum sequence number of a data file. That seems like what we actually want for the pruning I described above.
>>>>>>>>> 
>>>>>>>> I would expect a data file (appended rows or deletions by column value) to have a single sequence number that applies to the whole file. Even a delete-by-file-and-offset file can do with only a single sequence number (which must be larger than the sequence numbers of all deleted files). Why do we need a "minimum" data sequence per file?
>>>>>>>>> Off the top of my head [supporting non-key delete] requires adding additional information to the manifest file, indicating the columns that are used for the deletion. Only equality would be supported; if multiple columns were used, they would be combined with boolean-and. I don’t see anything too tricky about it.
>>>>>>>>> 
>>>>>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would be simple to extend the equality deletes to do this. We just need a way to have global scope, not just partition scope.
>>>>>>>>> 
>>>>>>>> I don't think anything special needs to be done with regards to scoping/partitioning of delete files. When scanning one or more data files, one must also consider any and all deletion files that could apply to them. The only way to prune deletion files from consideration is:
>>>>>>>> All of your data files have at least one partition column in common.
>>>>>>>> The deletion file is also partitioned on that column (at least).
>>>>>>>> The value sets of the data files do not overlap the value sets of the deletion files in that column.
>>>>>>>>  So given a dataset of sessions that is partitioned by device form factor and date, for example, you could have a delete (user_id=9876) in a deletion file that is not partitioned. And it would be "in scope" for all of those data files.
>>>>>>>> 
>>>>>>>> If you had the same dataset partitioned by hash(user_id) and your deletes were _also_ partitioned by hash(user_id) you would be able to prune those deletes while scanning the sessions.
>>>>>>>>> If we add this on a per-deletion file basis it is not clear if there is any relevance in preserving the concept of a unique row ID.
>>>>>>>>> 
>>>>>>>>> Agreed. That’s why I’ve been steering us away from the debate about whether keys are unique or not. Either way, a natural key delete must delete all of the records it matches.
>>>>>>>>> 
>>>>>>>>> I would assume that the maximum sequence number should appear in the table metadata
>>>>>>>>> 
>>>>>>>>> Agreed.
>>>>>>>>> 
>>>>>>>>> [W]ould you make it optional to assign a sequence number to a snapshot? “Replace” snapshots would not need one.
>>>>>>>>> 
>>>>>>>>> The only requirement is that it is monotonically increasing. If one isn’t used, we don’t have to increment. I’d say it is up to the implementation to decide. I would probably increment it every time to avoid errors.
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> 

Re: Updates/Deletes/Upserts in Iceberg

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.
Works for me too.

> On 3 Jul 2019, at 19:09, Erik Wright <er...@shopify.com.INVALID> wrote:
> 
> That works for me.
> 
> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> How about 9AM PDT on Friday, 5 July then?
> 
> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <owen.omalley@gmail.com <ma...@gmail.com>> wrote:
> I'd like to call in, but I'm out Thursday. Friday would work except 11am to 1pm pdt.
> 
> .. Owen
> 
> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> I'm available Thursday and Friday this week as well, but it's a holiday in the US so some people may be out. If there are no objections from anyone that would like to attend, then I'm up for that.
> 
> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <aokolnychyi@apple.com <ma...@apple.com>> wrote:
> I apologize for the delay on my side. I’ll still have to go through the last emails. I am available on Thursday/Friday this week and would be great to sync.
> 
> Thanks,
> Anton
> 
>> On 3 Jul 2019, at 01:29, Ryan Blue <rblue@netflix.com.INVALID <ma...@netflix.com.INVALID>> wrote:
>> 
>> Sorry I didn't get back to this thread last week. Let's try to have a video call to sync up on this next week. What days would work for everyone?
>> 
>> rb
>> 
>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wright@shopify.com <ma...@shopify.com>> wrote:
>> With regards to operation values. Currently they are:
>> append: data files were added and no files were removed.
>> replace: data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files.
>> overwrite: data files were deleted and added in a logical overwrite operation.
>> delete: data files were removed and their contents logically deleted.
>> If deletion files (with or without data files) are appended to the dataset, will we consider that an `append` operation? If so, if deletion and/or data files are appended, and whole files are also deleted, will we consider that an `overwrite`?
>> 
>> Given that the only apparent purpose of the operation field is to optimize snapshot expiration the above seems to meet its needs. An incremental reader can also skip `replace` snapshots but no others. Once it decides to read a snapshot I don't think there's any difference in how it processes the data for append/overwrite/delete cases.
>> 
>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>> I don’t see that we need [sequence numbers] for file/offset-deletes, since they apply to a specific file. They’re not harmful, but the don’t seem relevant.
>> 
>> These delete files will probably contain a path and an offset and could contain deletes for multiple files. In that case, the sequence number can be used to eliminate delete files that don’t need to be applied to a particular data file, just like the column equality deletes. Likewise, it can be used to drop the delete files when there are no data files with an older sequence number.
>> 
>> I don’t understand the purpose of the min sequence number, nor what the “min data seq” is.
>> 
>> Min sequence number would be used for pruning delete files without reading all the manifests to find out if there are old data files. If no manifest with data for a partition contains a file older than some sequence number N, then any delete file with a sequence number < N can be removed.
>> 
>> OK, so the minimum sequence number is an attribute of manifest files. Sounds good. It can likely permit us to optimize compaction operations as well (i.e., you can easily limit the operation to a subset of manifest files as long as they are the oldest ones).
>>  
>> The “min data seq” is the minimum sequence number of a data file. That seems like what we actually want for the pruning I described above.
>> 
>> I would expect a data file (appended rows or deletions by column value) to have a single sequence number that applies to the whole file. Even a delete-by-file-and-offset file can do with only a single sequence number (which must be larger than the sequence numbers of all deleted files). Why do we need a "minimum" data sequence per file?
>> Off the top of my head [supporting non-key delete] requires adding additional information to the manifest file, indicating the columns that are used for the deletion. Only equality would be supported; if multiple columns were used, they would be combined with boolean-and. I don’t see anything too tricky about it.
>> 
>> Yes, exactly. I actually phrased it wrong initially. I think it would be simple to extend the equality deletes to do this. We just need a way to have global scope, not just partition scope.
>> 
>> I don't think anything special needs to be done with regards to scoping/partitioning of delete files. When scanning one or more data files, one must also consider any and all deletion files that could apply to them. The only way to prune deletion files from consideration is:
>> All of your data files have at least one partition column in common.
>> The deletion file is also partitioned on that column (at least).
>> The value sets of the data files do not overlap the value sets of the deletion files in that column.
>>  So given a dataset of sessions that is partitioned by device form factor and date, for example, you could have a delete (user_id=9876) in a deletion file that is not partitioned. And it would be "in scope" for all of those data files.
>> 
>> If you had the same dataset partitioned by hash(user_id) and your deletes were _also_ partitioned by hash(user_id) you would be able to prune those deletes while scanning the sessions.
>> If we add this on a per-deletion file basis it is not clear if there is any relevance in preserving the concept of a unique row ID.
>> 
>> Agreed. That’s why I’ve been steering us away from the debate about whether keys are unique or not. Either way, a natural key delete must delete all of the records it matches.
>> 
>> I would assume that the maximum sequence number should appear in the table metadata
>> 
>> Agreed.
>> 
>> [W]ould you make it optional to assign a sequence number to a snapshot? “Replace” snapshots would not need one.
>> 
>> The only requirement is that it is monotonically increasing. If one isn’t used, we don’t have to increment. I’d say it is up to the implementation to decide. I would probably increment it every time to avoid errors.
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Updates/Deletes/Upserts in Iceberg

Posted by Erik Wright <er...@shopify.com.INVALID>.
That works for me.

On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> How about 9AM PDT on Friday, 5 July then?
>
> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> I'd like to call in, but I'm out Thursday. Friday would work except 11am
>> to 1pm pdt.
>>
>> .. Owen
>>
>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I'm available Thursday and Friday this week as well, but it's a holiday
>>> in the US so some people may be out. If there are no objections from anyone
>>> that would like to attend, then I'm up for that.
>>>
>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com>
>>> wrote:
>>>
>>>> I apologize for the delay on my side. I’ll still have to go through the
>>>> last emails. I am available on Thursday/Friday this week and would be great
>>>> to sync.
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>>
>>>> Sorry I didn't get back to this thread last week. Let's try to have a
>>>> video call to sync up on this next week. What days would work for everyone?
>>>>
>>>> rb
>>>>
>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com>
>>>> wrote:
>>>>
>>>>> With regards to operation values. Currently they are:
>>>>>
>>>>>    - append: data files were added and no files were removed.
>>>>>    - replace: data files were rewritten with the same data; i.e.,
>>>>>    compaction, changing the data file format, or relocating data files.
>>>>>    - overwrite: data files were deleted and added in a logical
>>>>>    overwrite operation.
>>>>>    - delete: data files were removed and their contents logically
>>>>>    deleted.
>>>>>
>>>>> If deletion files (with or without data files) are appended to the
>>>>> dataset, will we consider that an `append` operation? If so, if deletion
>>>>> and/or data files are appended, and whole files are also deleted, will we
>>>>> consider that an `overwrite`?
>>>>>
>>>>> Given that the only apparent purpose of the operation field is to
>>>>> optimize snapshot expiration the above seems to meet its needs. An
>>>>> incremental reader can also skip `replace` snapshots but no others. Once it
>>>>> decides to read a snapshot I don't think there's any difference in how it
>>>>> processes the data for append/overwrite/delete cases.
>>>>>
>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>>>> seem relevant.
>>>>>>
>>>>>> These delete files will probably contain a path and an offset and
>>>>>> could contain deletes for multiple files. In that case, the sequence number
>>>>>> can be used to eliminate delete files that don’t need to be applied to a
>>>>>> particular data file, just like the column equality deletes. Likewise, it
>>>>>> can be used to drop the delete files when there are no data files with an
>>>>>> older sequence number.
>>>>>>
>>>>>> I don’t understand the purpose of the min sequence number, nor what
>>>>>> the “min data seq” is.
>>>>>>
>>>>>> Min sequence number would be used for pruning delete files without
>>>>>> reading all the manifests to find out if there are old data files. If no
>>>>>> manifest with data for a partition contains a file older than some sequence
>>>>>> number N, then any delete file with a sequence number < N can be removed.
>>>>>>
>>>>> OK, so the minimum sequence number is an attribute of manifest files.
>>>>> Sounds good. It can likely permit us to optimize compaction operations as
>>>>> well (i.e., you can easily limit the operation to a subset of manifest
>>>>> files as long as they are the oldest ones).
>>>>>
>>>>>
>>>>>> The “min data seq” is the minimum sequence number of a data file.
>>>>>> That seems like what we actually want for the pruning I described above.
>>>>>>
>>>>> I would expect a data file (appended rows or deletions by column
>>>>> value) to have a single sequence number that applies to the whole file.
>>>>> Even a delete-by-file-and-offset file can do with only a single sequence
>>>>> number (which must be larger than the sequence numbers of all deleted
>>>>> files). Why do we need a "minimum" data sequence per file?
>>>>>
>>>>>> Off the top of my head [supporting non-key delete] requires adding
>>>>>> additional information to the manifest file, indicating the columns that
>>>>>> are used for the deletion. Only equality would be supported; if multiple
>>>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>>>> anything too tricky about it.
>>>>>>
>>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would
>>>>>> be simple to extend the equality deletes to do this. We just need a way to
>>>>>> have global scope, not just partition scope.
>>>>>>
>>>>> I don't think anything special needs to be done with regards to
>>>>> scoping/partitioning of delete files. When scanning one or more data files,
>>>>> one must also consider any and all deletion files that could apply to them.
>>>>> The only way to prune deletion files from consideration is:
>>>>>
>>>>>    1. All of your data files have at least one partition column in
>>>>>    common.
>>>>>    2. The deletion file is also partitioned on that column (at least).
>>>>>    3. The value sets of the data files do not overlap the value sets
>>>>>    of the deletion files in that column.
>>>>>
>>>>>  So given a dataset of sessions that is partitioned by device form
>>>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>>>> deletion file that is not partitioned. And it would be "in scope" for all
>>>>> of those data files.
>>>>>
>>>>> If you had the same dataset partitioned by hash(user_id) and your
>>>>> deletes were _also_ partitioned by hash(user_id) you would be able to prune
>>>>> those deletes while scanning the sessions.
>>>>>
>>>>>> If we add this on a per-deletion file basis it is not clear if there
>>>>>> is any relevance in preserving the concept of a unique row ID.
>>>>>>
>>>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>>>> whether keys are unique or not. Either way, a natural key delete must
>>>>>> delete all of the records it matches.
>>>>>>
>>>>>> I would assume that the maximum sequence number should appear in the
>>>>>> table metadata
>>>>>>
>>>>>> Agreed.
>>>>>>
>>>>>> [W]ould you make it optional to assign a sequence number to a
>>>>>> snapshot? “Replace” snapshots would not need one.
>>>>>>
>>>>>> The only requirement is that it is monotonically increasing. If one
>>>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>>>> implementation to decide. I would probably increment it every time to avoid
>>>>>> errors.
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
How about 9AM PDT on Friday, 5 July then?

On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <ow...@gmail.com>
wrote:

> I'd like to call in, but I'm out Thursday. Friday would work except 11am
> to 1pm pdt.
>
> .. Owen
>
> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'm available Thursday and Friday this week as well, but it's a holiday
>> in the US so some people may be out. If there are no objections from anyone
>> that would like to attend, then I'm up for that.
>>
>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com>
>> wrote:
>>
>>> I apologize for the delay on my side. I’ll still have to go through the
>>> last emails. I am available on Thursday/Friday this week and would be great
>>> to sync.
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>
>>> Sorry I didn't get back to this thread last week. Let's try to have a
>>> video call to sync up on this next week. What days would work for everyone?
>>>
>>> rb
>>>
>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com>
>>> wrote:
>>>
>>>> With regards to operation values. Currently they are:
>>>>
>>>>    - append: data files were added and no files were removed.
>>>>    - replace: data files were rewritten with the same data; i.e.,
>>>>    compaction, changing the data file format, or relocating data files.
>>>>    - overwrite: data files were deleted and added in a logical
>>>>    overwrite operation.
>>>>    - delete: data files were removed and their contents logically
>>>>    deleted.
>>>>
>>>> If deletion files (with or without data files) are appended to the
>>>> dataset, will we consider that an `append` operation? If so, if deletion
>>>> and/or data files are appended, and whole files are also deleted, will we
>>>> consider that an `overwrite`?
>>>>
>>>> Given that the only apparent purpose of the operation field is to
>>>> optimize snapshot expiration the above seems to meet its needs. An
>>>> incremental reader can also skip `replace` snapshots but no others. Once it
>>>> decides to read a snapshot I don't think there's any difference in how it
>>>> processes the data for append/overwrite/delete cases.
>>>>
>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>>> seem relevant.
>>>>>
>>>>> These delete files will probably contain a path and an offset and
>>>>> could contain deletes for multiple files. In that case, the sequence number
>>>>> can be used to eliminate delete files that don’t need to be applied to a
>>>>> particular data file, just like the column equality deletes. Likewise, it
>>>>> can be used to drop the delete files when there are no data files with an
>>>>> older sequence number.
>>>>>
>>>>> I don’t understand the purpose of the min sequence number, nor what
>>>>> the “min data seq” is.
>>>>>
>>>>> Min sequence number would be used for pruning delete files without
>>>>> reading all the manifests to find out if there are old data files. If no
>>>>> manifest with data for a partition contains a file older than some sequence
>>>>> number N, then any delete file with a sequence number < N can be removed.
>>>>>
>>>> OK, so the minimum sequence number is an attribute of manifest files.
>>>> Sounds good. It can likely permit us to optimize compaction operations as
>>>> well (i.e., you can easily limit the operation to a subset of manifest
>>>> files as long as they are the oldest ones).
>>>>
>>>>
>>>>> The “min data seq” is the minimum sequence number of a data file. That
>>>>> seems like what we actually want for the pruning I described above.
>>>>>
>>>> I would expect a data file (appended rows or deletions by column value)
>>>> to have a single sequence number that applies to the whole file. Even a
>>>> delete-by-file-and-offset file can do with only a single sequence number
>>>> (which must be larger than the sequence numbers of all deleted files). Why
>>>> do we need a "minimum" data sequence per file?
>>>>
>>>>> Off the top of my head [supporting non-key delete] requires adding
>>>>> additional information to the manifest file, indicating the columns that
>>>>> are used for the deletion. Only equality would be supported; if multiple
>>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>>> anything too tricky about it.
>>>>>
>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would
>>>>> be simple to extend the equality deletes to do this. We just need a way to
>>>>> have global scope, not just partition scope.
>>>>>
>>>> I don't think anything special needs to be done with regards to
>>>> scoping/partitioning of delete files. When scanning one or more data files,
>>>> one must also consider any and all deletion files that could apply to them.
>>>> The only way to prune deletion files from consideration is:
>>>>
>>>>    1. All of your data files have at least one partition column in
>>>>    common.
>>>>    2. The deletion file is also partitioned on that column (at least).
>>>>    3. The value sets of the data files do not overlap the value sets
>>>>    of the deletion files in that column.
>>>>
>>>>  So given a dataset of sessions that is partitioned by device form
>>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>>> deletion file that is not partitioned. And it would be "in scope" for all
>>>> of those data files.
>>>>
>>>> If you had the same dataset partitioned by hash(user_id) and your
>>>> deletes were _also_ partitioned by hash(user_id) you would be able to prune
>>>> those deletes while scanning the sessions.
>>>>
>>>>> If we add this on a per-deletion file basis it is not clear if there
>>>>> is any relevance in preserving the concept of a unique row ID.
>>>>>
>>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>>> whether keys are unique or not. Either way, a natural key delete must
>>>>> delete all of the records it matches.
>>>>>
>>>>> I would assume that the maximum sequence number should appear in the
>>>>> table metadata
>>>>>
>>>>> Agreed.
>>>>>
>>>>> [W]ould you make it optional to assign a sequence number to a
>>>>> snapshot? “Replace” snapshots would not need one.
>>>>>
>>>>> The only requirement is that it is monotonically increasing. If one
>>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>>> implementation to decide. I would probably increment it every time to avoid
>>>>> errors.
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Posted by Owen O'Malley <ow...@gmail.com>.
I'd like to call in, but I'm out Thursday. Friday would work except 11am to
1pm pdt.

.. Owen

On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'm available Thursday and Friday this week as well, but it's a holiday in
> the US so some people may be out. If there are no objections from anyone
> that would like to attend, then I'm up for that.
>
> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com>
> wrote:
>
>> I apologize for the delay on my side. I’ll still have to go through the
>> last emails. I am available on Thursday/Friday this week and would be great
>> to sync.
>>
>> Thanks,
>> Anton
>>
>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>> Sorry I didn't get back to this thread last week. Let's try to have a
>> video call to sync up on this next week. What days would work for everyone?
>>
>> rb
>>
>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com>
>> wrote:
>>
>>> With regards to operation values. Currently they are:
>>>
>>>    - append: data files were added and no files were removed.
>>>    - replace: data files were rewritten with the same data; i.e.,
>>>    compaction, changing the data file format, or relocating data files.
>>>    - overwrite: data files were deleted and added in a logical
>>>    overwrite operation.
>>>    - delete: data files were removed and their contents logically
>>>    deleted.
>>>
>>> If deletion files (with or without data files) are appended to the
>>> dataset, will we consider that an `append` operation? If so, if deletion
>>> and/or data files are appended, and whole files are also deleted, will we
>>> consider that an `overwrite`?
>>>
>>> Given that the only apparent purpose of the operation field is to
>>> optimize snapshot expiration the above seems to meet its needs. An
>>> incremental reader can also skip `replace` snapshots but no others. Once it
>>> decides to read a snapshot I don't think there's any difference in how it
>>> processes the data for append/overwrite/delete cases.
>>>
>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>> seem relevant.
>>>>
>>>> These delete files will probably contain a path and an offset and could
>>>> contain deletes for multiple files. In that case, the sequence number can
>>>> be used to eliminate delete files that don’t need to be applied to a
>>>> particular data file, just like the column equality deletes. Likewise, it
>>>> can be used to drop the delete files when there are no data files with an
>>>> older sequence number.
>>>>
>>>> I don’t understand the purpose of the min sequence number, nor what the
>>>> “min data seq” is.
>>>>
>>>> Min sequence number would be used for pruning delete files without
>>>> reading all the manifests to find out if there are old data files. If no
>>>> manifest with data for a partition contains a file older than some sequence
>>>> number N, then any delete file with a sequence number < N can be removed.
>>>>
>>> OK, so the minimum sequence number is an attribute of manifest files.
>>> Sounds good. It can likely permit us to optimize compaction operations as
>>> well (i.e., you can easily limit the operation to a subset of manifest
>>> files as long as they are the oldest ones).
>>>
>>>
>>>> The “min data seq” is the minimum sequence number of a data file. That
>>>> seems like what we actually want for the pruning I described above.
>>>>
>>> I would expect a data file (appended rows or deletions by column value)
>>> to have a single sequence number that applies to the whole file. Even a
>>> delete-by-file-and-offset file can do with only a single sequence number
>>> (which must be larger than the sequence numbers of all deleted files). Why
>>> do we need a "minimum" data sequence per file?
>>>
>>>> Off the top of my head [supporting non-key delete] requires adding
>>>> additional information to the manifest file, indicating the columns that
>>>> are used for the deletion. Only equality would be supported; if multiple
>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>> anything too tricky about it.
>>>>
>>>> Yes, exactly. I actually phrased it wrong initially. I think it would
>>>> be simple to extend the equality deletes to do this. We just need a way to
>>>> have global scope, not just partition scope.
>>>>
>>> I don't think anything special needs to be done with regards to
>>> scoping/partitioning of delete files. When scanning one or more data files,
>>> one must also consider any and all deletion files that could apply to them.
>>> The only way to prune deletion files from consideration is:
>>>
>>>    1. All of your data files have at least one partition column in
>>>    common.
>>>    2. The deletion file is also partitioned on that column (at least).
>>>    3. The value sets of the data files do not overlap the value sets of
>>>    the deletion files in that column.
>>>
>>>  So given a dataset of sessions that is partitioned by device form
>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>> deletion file that is not partitioned. And it would be "in scope" for all
>>> of those data files.
>>>
>>> If you had the same dataset partitioned by hash(user_id) and your
>>> deletes were _also_ partitioned by hash(user_id) you would be able to prune
>>> those deletes while scanning the sessions.
>>>
>>>> If we add this on a per-deletion file basis it is not clear if there is
>>>> any relevance in preserving the concept of a unique row ID.
>>>>
>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>> whether keys are unique or not. Either way, a natural key delete must
>>>> delete all of the records it matches.
>>>>
>>>> I would assume that the maximum sequence number should appear in the
>>>> table metadata
>>>>
>>>> Agreed.
>>>>
>>>> [W]ould you make it optional to assign a sequence number to a snapshot?
>>>> “Replace” snapshots would not need one.
>>>>
>>>> The only requirement is that it is monotonically increasing. If one
>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>> implementation to decide. I would probably increment it every time to avoid
>>>> errors.
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I'm available Thursday and Friday this week as well, but it's a holiday in
the US so some people may be out. If there are no objections from anyone
that would like to attend, then I'm up for that.

On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <ao...@apple.com>
wrote:

> I apologize for the delay on my side. I’ll still have to go through the
> last emails. I am available on Thursday/Friday this week and would be great
> to sync.
>
> Thanks,
> Anton
>
> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> Sorry I didn't get back to this thread last week. Let's try to have a
> video call to sync up on this next week. What days would work for everyone?
>
> rb
>
> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <er...@shopify.com>
> wrote:
>
>> With regards to operation values. Currently they are:
>>
>>    - append: data files were added and no files were removed.
>>    - replace: data files were rewritten with the same data; i.e.,
>>    compaction, changing the data file format, or relocating data files.
>>    - overwrite: data files were deleted and added in a logical overwrite
>>    operation.
>>    - delete: data files were removed and their contents logically
>>    deleted.
>>
>> If deletion files (with or without data files) are appended to the
>> dataset, will we consider that an `append` operation? If so, if deletion
>> and/or data files are appended, and whole files are also deleted, will we
>> consider that an `overwrite`?
>>
>> Given that the only apparent purpose of the operation field is to
>> optimize snapshot expiration the above seems to meet its needs. An
>> incremental reader can also skip `replace` snapshots but no others. Once it
>> decides to read a snapshot I don't think there's any difference in how it
>> processes the data for append/overwrite/delete cases.
>>
>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>> since they apply to a specific file. They’re not harmful, but the don’t
>>> seem relevant.
>>>
>>> These delete files will probably contain a path and an offset and could
>>> contain deletes for multiple files. In that case, the sequence number can
>>> be used to eliminate delete files that don’t need to be applied to a
>>> particular data file, just like the column equality deletes. Likewise, it
>>> can be used to drop the delete files when there are no data files with an
>>> older sequence number.
>>>
>>> I don’t understand the purpose of the min sequence number, nor what the
>>> “min data seq” is.
>>>
>>> Min sequence number would be used for pruning delete files without
>>> reading all the manifests to find out if there are old data files. If no
>>> manifest with data for a partition contains a file older than some sequence
>>> number N, then any delete file with a sequence number < N can be removed.
>>>
>> OK, so the minimum sequence number is an attribute of manifest files.
>> Sounds good. It can likely permit us to optimize compaction operations as
>> well (i.e., you can easily limit the operation to a subset of manifest
>> files as long as they are the oldest ones).
>>
>>
>>> The “min data seq” is the minimum sequence number of a data file. That
>>> seems like what we actually want for the pruning I described above.
>>>
>> I would expect a data file (appended rows or deletions by column value)
>> to have a single sequence number that applies to the whole file. Even a
>> delete-by-file-and-offset file can do with only a single sequence number
>> (which must be larger than the sequence numbers of all deleted files). Why
>> do we need a "minimum" data sequence per file?
>>
>>> Off the top of my head [supporting non-key delete] requires adding
>>> additional information to the manifest file, indicating the columns that
>>> are used for the deletion. Only equality would be supported; if multiple
>>> columns were used, they would be combined with boolean-and. I don’t see
>>> anything too tricky about it.
>>>
>>> Yes, exactly. I actually phrased it wrong initially. I think it would be
>>> simple to extend the equality deletes to do this. We just need a way to
>>> have global scope, not just partition scope.
>>>
>> I don't think anything special needs to be done with regards to
>> scoping/partitioning of delete files. When scanning one or more data files,
>> one must also consider any and all deletion files that could apply to them.
>> The only way to prune deletion files from consideration is:
>>
>>    1. All of your data files have at least one partition column in
>>    common.
>>    2. The deletion file is also partitioned on that column (at least).
>>    3. The value sets of the data files do not overlap the value sets of
>>    the deletion files in that column.
>>
>>  So given a dataset of sessions that is partitioned by device form factor
>> and date, for example, you could have a delete (user_id=9876) in a deletion
>> file that is not partitioned. And it would be "in scope" for all of those
>> data files.
>>
>> If you had the same dataset partitioned by hash(user_id) and your deletes
>> were _also_ partitioned by hash(user_id) you would be able to prune those
>> deletes while scanning the sessions.
>>
>>> If we add this on a per-deletion file basis it is not clear if there is
>>> any relevance in preserving the concept of a unique row ID.
>>>
>>> Agreed. That’s why I’ve been steering us away from the debate about
>>> whether keys are unique or not. Either way, a natural key delete must
>>> delete all of the records it matches.
>>>
>>> I would assume that the maximum sequence number should appear in the
>>> table metadata
>>>
>>> Agreed.
>>>
>>> [W]ould you make it optional to assign a sequence number to a snapshot?
>>> “Replace” snapshots would not need one.
>>>
>>> The only requirement is that it is monotonically increasing. If one
>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>> implementation to decide. I would probably increment it every time to avoid
>>> errors.
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.
I apologize for the delay on my side. I’ll still have to go through the last emails. I am available on Thursday/Friday this week and would be great to sync.

Thanks,
Anton

> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Sorry I didn't get back to this thread last week. Let's try to have a video call to sync up on this next week. What days would work for everyone?
> 
> rb
> 
> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wright@shopify.com <ma...@shopify.com>> wrote:
> With regards to operation values. Currently they are:
> append: data files were added and no files were removed.
> replace: data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files.
> overwrite: data files were deleted and added in a logical overwrite operation.
> delete: data files were removed and their contents logically deleted.
> If deletion files (with or without data files) are appended to the dataset, will we consider that an `append` operation? If so, if deletion and/or data files are appended, and whole files are also deleted, will we consider that an `overwrite`?
> 
> Given that the only apparent purpose of the operation field is to optimize snapshot expiration the above seems to meet its needs. An incremental reader can also skip `replace` snapshots but no others. Once it decides to read a snapshot I don't think there's any difference in how it processes the data for append/overwrite/delete cases.
> 
> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> I don’t see that we need [sequence numbers] for file/offset-deletes, since they apply to a specific file. They’re not harmful, but the don’t seem relevant.
> 
> These delete files will probably contain a path and an offset and could contain deletes for multiple files. In that case, the sequence number can be used to eliminate delete files that don’t need to be applied to a particular data file, just like the column equality deletes. Likewise, it can be used to drop the delete files when there are no data files with an older sequence number.
> 
> I don’t understand the purpose of the min sequence number, nor what the “min data seq” is.
> 
> Min sequence number would be used for pruning delete files without reading all the manifests to find out if there are old data files. If no manifest with data for a partition contains a file older than some sequence number N, then any delete file with a sequence number < N can be removed.
> 
> OK, so the minimum sequence number is an attribute of manifest files. Sounds good. It can likely permit us to optimize compaction operations as well (i.e., you can easily limit the operation to a subset of manifest files as long as they are the oldest ones).
>  
> The “min data seq” is the minimum sequence number of a data file. That seems like what we actually want for the pruning I described above.
> 
> I would expect a data file (appended rows or deletions by column value) to have a single sequence number that applies to the whole file. Even a delete-by-file-and-offset file can do with only a single sequence number (which must be larger than the sequence numbers of all deleted files). Why do we need a "minimum" data sequence per file?
> Off the top of my head [supporting non-key delete] requires adding additional information to the manifest file, indicating the columns that are used for the deletion. Only equality would be supported; if multiple columns were used, they would be combined with boolean-and. I don’t see anything too tricky about it.
> 
> Yes, exactly. I actually phrased it wrong initially. I think it would be simple to extend the equality deletes to do this. We just need a way to have global scope, not just partition scope.
> 
> I don't think anything special needs to be done with regards to scoping/partitioning of delete files. When scanning one or more data files, one must also consider any and all deletion files that could apply to them. The only way to prune deletion files from consideration is:
> All of your data files have at least one partition column in common.
> The deletion file is also partitioned on that column (at least).
> The value sets of the data files do not overlap the value sets of the deletion files in that column.
>  So given a dataset of sessions that is partitioned by device form factor and date, for example, you could have a delete (user_id=9876) in a deletion file that is not partitioned. And it would be "in scope" for all of those data files.
> 
> If you had the same dataset partitioned by hash(user_id) and your deletes were _also_ partitioned by hash(user_id) you would be able to prune those deletes while scanning the sessions.
> If we add this on a per-deletion file basis it is not clear if there is any relevance in preserving the concept of a unique row ID.
> 
> Agreed. That’s why I’ve been steering us away from the debate about whether keys are unique or not. Either way, a natural key delete must delete all of the records it matches.
> 
> I would assume that the maximum sequence number should appear in the table metadata
> 
> Agreed.
> 
> [W]ould you make it optional to assign a sequence number to a snapshot? “Replace” snapshots would not need one.
> 
> The only requirement is that it is monotonically increasing. If one isn’t used, we don’t have to increment. I’d say it is up to the implementation to decide. I would probably increment it every time to avoid errors.
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix