You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Jack Ye <ye...@gmail.com> on 2022/02/01 00:50:47 UTC

Re: Continuing the Secondary Index Discussion

Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.

The goal here is to have a monotonically increasing number that could be
used to detect what files have been newly added and should be indexed. This
is especially important to know how up-to-date an index is for each
partition.

In a table without compaction, sequence number of files would continue to
increase. If we have indexed all files up to sequence number 3, we know
that the next indexing process needs to index all the files with sequence
number greater than 3. But during compaction, files will be rewritten with
the starting sequence number. During commit time the sequence number might
already gone much higher. For example, I start compaction at seq=3, and
when this is running for a few hours, there are 10 inserts done to the
table, and the current sequence number is 13. When I commit the compacted
data files, those files are essentially written to a sequence number older
than the latest. This breaks a lot of assumption like (1) I cannot just
find new data to index by calculating if the sequence number is higher than
certain value, (2) a reader cannot determine if an index could be used
based on the sequence number.

The solution I was describing is to have another watermark that is
monotonically increasing regardless of compaction or not. So Compaction
would commit those files at seq=3, but the new watermark of those files are
at 14. Then we can use this new watermark for all the index operations.

Best,
Jack Ye


On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wc...@gmail.com>
wrote:

> Hi Jack,
>
>
> Thanks for the summary and it helps me a lot.
>
> Trying to understand point 2 and having my 2 cents.
>
> *a mechanism for tracking file change is needed. Unfortunately sequence
> numbers cannot be used due to the introduction of compaction that rewrites
> files into a lower sequence number. Another monotonically increasing
> watermark for files has to be introduced for index change detection and
> invalidation.*
>
> Please let me know if I have some wrong/silly assumptions.
>
> So the *reason* we couldn't use sequence numbers as the validness
> indicator of the index is compaction. Before compaction (taking a very
> simple example), the data file and index file should have a mapping and the
> tableScan.planTask() is able to decide whether to use index purely by
> comparing sequence numbers (as well as index spec id, if we have one).
>
> After compaction, the tableScan.planTask() couldn't do so because data
> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
> might be returned.
>
> I wonder how an additional watermark only for the index could solve the
> problem?
>
>
> And based on my gut feeling, I feel we could somehow solve the problem
> with the current sequence number:
>
> *Option 1*: When compacting, we could compact those data files that index
> is up to date to one group, those files that index is stale/not exist to
> another group. (Just like what we are doing with the data file that are
> unpartitioned/partition spec id not match).
>
> The *pro* is that we could still leverage indexes for part of the data
> files, and we could reuse the sequence number.
>
> The *cons* are that the compaction might not reach the target size and we
> might still have small files.
>
> *Option 2*:
>
> Assume compaction is often triggered by data engineers and the compaction
> action is not so frequent. We could directly invalid all index files for
> those compacted. And the user needs to rebuild the index every time after
> compaction.
>
> *Pro*: Easy to implement, clear to understand.
>
> *Cons*: Relatively bad user experience. Waste some computing resources to
> redo some work.
>
> *Option 3*:
>
> We could leverage the engine's computing resource to always rebuild
> indexes during data compaction.
>
> *Pro*: User could leverage index after the data compaction.
>
> *Cons*: Rebuilding might take longer time/resources.
>
> *Option 3 alternative*: add a configuration property to compaction,
> control if the user wants to rebuild the index during compaction.
>
>
> Please let me know if you have any thoughts on this.
>
> Best,
>
> Zaicheng
>
> Jack Ye <ye...@gmail.com> 于2022年1月26日周三 13:17写道：
>
>> Thanks for the fast responses!
>>
>> Based on the conversations above, it sounds like we have the following
>> consensus:
>>
>> 1. asynchronous index creation is preferred, although synchronous index
>> creation is possible.
>> 2. a mechanism for tracking file change is needed. Unfortunately sequence
>> number cannot be used due to the introduction of compaction that rewrites
>> files into a lower sequence number. Another monotonically increasing
>> watermark for files has to be introduced for index change detection and
>> invalidation.
>> 3. index creation and maintenance procedures should be pluggable by
>> different engines. This should not be an issue because Iceberg has been
>> designing action interfaces for different table maintenance procedures so
>> far, so what Zaicheng describes should be the natural development direction
>> once the work is started.
>>
>> Regarding index level, I also think partition level index is more
>> important, but it seems like we have to first do file level as the
>> foundation. This leads to the index storage part. I am not talking about
>> using Parquet to store it, I am asking about what Miao is describing. I
>> don't think we have a consensus around the exact place to store index
>> information yet. My memory is that there are 2 ways:
>> 1. file level index stored as a binary field in manifest, partition level
>> index stored as a binary field in manifest list. This would only work for
>> small size indexes like bitmap (or bloom filter to certain extent)
>> 2. some sort of binary file to store index data, and index metadata (e.g.
>> index type) and pointer to the binary index data file is kept in 1 (I think
>> this is what Miao is describing)
>> 3. some sort of index spec to independently store index metadata and
>> data, similar to what we are proposing today for view
>>
>> Another aspect of index storage is the index file location in case of 2
>> and 3. In the original doc a specific file path structure is proposed,
>> whereas this is a bit against the Iceberg standard of not assuming file
>> path to work with any storage. We also need more clarity in that topic.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wc...@gmail.com>
>> wrote:
>>
>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>
>>> Initially we are planning to add index feature for our internal Trino
>>> and feel like iceberg could be the best place for holding/buiding the index
>>> data.
>>> We are very interested in having and contributing to this feature.
>>> (Pretty new to the community, still having my 2 cents)
>>>
>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>> interface for creating/updating/deleting index and each engine can decide
>>> how to invoke these method (in a distributed manner or single thread
>>> manner, in async or sync).
>>> Take our use case as an example, we plan to have a new DDL syntax
>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>> col_1", and our SQL engine will create distributed index creation/updating
>>> operator. Each operator will invoke the index related method provided by
>>> iceberg.
>>>
>>> Storage): Does the index data have to be a file? Wondering if we want to
>>> design the index data storage interface in such way that people can plugin
>>> different index storage(file storage/centralized index storage service)
>>> later on.
>>>
>>> Thanks,
>>> Zaicheng
>>>
>>>
>>> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>
>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>> created a slack channel for index work. I suggested him adding Anton and
>>>> you to the channel.
>>>>
>>>>
>>>>
>>>> I still remember some conclusions from previous discussions.
>>>>
>>>>
>>>>
>>>> 1). Index types support: We planned to support Skipping Index first.
>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>> which reduces index reading overhead. Index file can be applied when
>>>> generating the scan task.
>>>>
>>>>
>>>>
>>>> 2). As Ryan mentioned, Sequence number will be used to indicate whether
>>>> an index is valid. Sequence number can link the data evolution with index
>>>> evolution.
>>>>
>>>>
>>>>
>>>> 3). Storage: We planned to have simple file format which includes
>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>> content. It is not necessary to use Parquet to store index. Initial thought
>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>> mapping to 1 index file. As Ryan said, file level implementation could be a
>>>> step stone for Partition level implementation.
>>>>
>>>>
>>>>
>>>> 4). How to build index: We want to keep the index reading and writing
>>>> interface with Iceberg and leave the actual building logic as Engine
>>>> specific (i.e., we can use different compute to build Index without
>>>> changing anything inside Iceberg).
>>>>
>>>>
>>>>
>>>> Misc:
>>>>
>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>>>
>>>> Design doc:
>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>
>>>> PR should have been merged.
>>>>
>>>> Guy from IBM did a partial PoC and provided a private doc. I will ask
>>>> if he can make it public.
>>>>
>>>>
>>>>
>>>> We can continue the discussion and breaking down the big tasks into
>>>> tickets.
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> Miao
>>>>
>>>> *From: *Ryan Blue <bl...@tabular.io>
>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>
>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>
>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>> start adding more indexes.
>>>>
>>>>
>>>>
>>>> > Scope of native index support
>>>>
>>>>
>>>>
>>>> The way I think about it, the biggest challenge here is how to know
>>>> when you can use an index. For example, if you have a partition index that
>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>> 97613097151667, then you basically have no idea what files are covered or
>>>> not and can't use it. On the other hand, if you know that the index was up
>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>> then you just have to read any data file that was written at sequence
>>>> number 12.
>>>>
>>>>
>>>>
>>>> The problem of where you can use an index makes me think that it is
>>>> best to maintain index metadata within Iceberg. An alternative is to try to
>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>> possible -- you'd have to support index updates in every writer that
>>>> touches table data. You would have to spend the time updating indexes at
>>>> write time, but there are competing priorities like making data available.
>>>> So I think you want asynchronous index updates and that leads to
>>>> integration with the table format.
>>>>
>>>>
>>>>
>>>> > Index levels
>>>>
>>>>
>>>>
>>>> I think that partition-level indexes are better for job planning
>>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>>> files at the task level. I would probably focus on partition-level, but I'm
>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>> partition-level, given that we would be able to track index data in the
>>>> same format.
>>>>
>>>>
>>>>
>>>> > Index storage
>>>>
>>>>
>>>>
>>>> Do you mean putting indexes in Parquet, or using Parquet for indexes? I
>>>> think that bloom filters would probably exceed the amount of data we'd want
>>>> to put into a Parquet binary column, probably at the file level and almost
>>>> certainly at the partition level, since the size depends on the number of
>>>> distinct values and the primary use is for identifiers.
>>>>
>>>>
>>>>
>>>> > Indexing process
>>>>
>>>>
>>>>
>>>> Synchronous is nice, but as I said above, I think we have to support
>>>> async because it is too complicated to update every writer that touches a
>>>> table and you may not want to pay the price at write time.
>>>>
>>>>
>>>>
>>>> > Index validation
>>>>
>>>>
>>>>
>>>> I think this is pretty much what I talked about for question 1. I think
>>>> that we have a good plan around using sequence numbers, if we want to do
>>>> this.
>>>>
>>>>
>>>>
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>>
>>>>
>>>> Based on the conversation in the last community sync and the Iceberg
>>>> Slack channel, it seems like multiple parties have interest in continuing
>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>> restart the thread to continue the discussion.
>>>>
>>>>
>>>>
>>>> So far most people refer to the document authored by Miao Wang
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>> which has a lot of useful information about the design and implementation.
>>>> However, the document is also quite old (over a year now) and a lot has
>>>> changed in Iceberg since then. I think the document leaves the following
>>>> open topics that we need to continue to address:
>>>>
>>>>
>>>>
>>>> 1. *scope of native index support*: what type of index should Iceberg
>>>> support natively, how should developers allocate effort between adding
>>>> support of Iceberg native index compared to developing Iceberg support for
>>>> holistic indexing projects such as HyperSpace
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>> .
>>>>
>>>>
>>>>
>>>> 2. *index levels*: we have talked about partition level indexing and
>>>> file level indexing. More clarity is needed for these index levels and the
>>>> level of interest and support needed for those different indexing levels.
>>>>
>>>>
>>>>
>>>> 3. *index storage*: we had unsettled debates around making index
>>>> separated files or embedding it as a part of existing Iceberg file
>>>> structure. We need to come up with certain criteria such as index size,
>>>> easiness to generate during write, etc. to settle the discussion.
>>>>
>>>>
>>>>
>>>> 4. *Indexing process*: as stated in Miao's document, indexes could be
>>>> created during the data writing process synchronously, or built
>>>> asynchronously through an index service. Discussion is needed for the focus
>>>> of the Iceberg index functionalities.
>>>>
>>>>
>>>>
>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>>> is needed in this domain, including if we need another sequence number to
>>>> track such invalidation.
>>>>
>>>>
>>>>
>>>> I suggest we iterate a bit on this list of open questions, and then we
>>>> can have a meeting to discuss those aspects, and produce an updated
>>>> document addressing those aspects to provide a clear path forward for
>>>> developers interested in adding features in this domain.
>>>>
>>>>
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Tabular
>>>>
>>>

The Secondary Indexes design using a new sequence number //Re: [External] Continuing the Secondary Index Discussion

Posted by leilei hu <hu...@gmail.com>.

We at Bytedance spend some time working on solving the Secondary Indexes invalidation design.
We put these details into a document:
https://docs.google.com/document/d/17q0pukixKR2a2BESJWykz4ENhCKSf8yQr751P2i7PF8 <https://docs.google.com/document/d/17q0pukixKR2a2BESJWykz4ENhCKSf8yQr751P2i7PF8>
The document includes the statistics metadata design and secondary indexes invalidation with this metadata. With @Jack Ye <https://apache-iceberg.slack.com/team/U02A0P9V5HA>’s idea by introducing a new sequence number, we add a new sequence number （named ‘write-id’）in table format to track data and file change.
Because for data compact, the sequence number of the newly generated data file is not incremental. File-level changes cannot be tracked by using the data sequence number. So add a field ‘write-id’, and work well with compaction situations.


The code logic about the new sequence number  'write-id' to track table changes is： https://github.com/apache/iceberg/pull/4460 <https://github.com/apache/iceberg/pull/4460> . 
For example, after the insert operation, the value of the new sequence number  'write-id’ in manifest entry is “"writer_id":{"long":1}”. This value of the  'write-id’ will not change after data compression because datafile is immutable. 



Please let me know if you have any thoughts. We could also discuss this during the sync meeting. Thanks.

> 2022年3月8日 下午8:39，Zaicheng Wang <wc...@gmail.com> 写道：
> 
> Hi PF,
> 
> Sure, rescheduled the meeting to an CET friendly time.
> The meeting now is scheduled on 9AM, March 11th, PST (6PM CST, March 11th).
> The meeting link is meet.google.com/ttd-jzid-abp <http://meet.google.com/ttd-jzid-abp> 
> Please feel free to slack me or tag me in the slack channel if anyone would like to get a meeting invitation (or you could directly join the meeting).
> 
> Best,
> Zaicheng
> 
> 
> 
> Piotr Findeisen <piotr@starburstdata.com <ma...@starburstdata.com>> 于2022年3月7日周一 21:54写道：
> Hi Zaicheng,
> 
> thanks for following up on this. I'm certainly interested. 
> The proposed time doesn't work for me though, I'm in the CET time zone.
> 
> Best,
> PF
> 
> 
> On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wcatp19891104@gmail.com <ma...@gmail.com>> wrote:
> Hi dev folks,
> 
> As discussed in the sync <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1> meeting, we will have a dedicated meeting on this topic. 
> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting link is https://meet.google.com/ttd-jzid-abp <https://meet.google.com/ttd-jzid-abp>
> Please let me know if the time does not work for you.
> 
> Thanks,
> Zaicheng
> 
> zaicheng wang <wangzaicheng@bytedance.com <ma...@bytedance.com>> 于2022年3月2日周三 21:17写道：
> Hi folks,
> 
> This is Zaicheng from bytedance. We spend some time working on solving the index invalidation problem as we discussed in the dev email channel.
> And when we are working on the POC, we also realize there are some metadata changes that might be introduced.
> We put these details into a document:
> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing <https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing>
> The document includes two proposals for solving the index invalidation problem: one from @Jack Ye’s idea on introducing a new sequence number,  another one is by leveraging the current manifest entry structure. The document will also describe the corresponding table spec change.
> Please let me know if you have any thoughts. We could also discuss this during the sync meeting. 
> 
> Thanks,
> Zaicheng
> 
> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
> 
> The goal here is to have a monotonically increasing number that could be used to detect what files have been newly added and should be indexed. This is especially important to know how up-to-date an index is for each partition.
> 
> In a table without compaction, sequence number of files would continue to increase. If we have indexed all files up to sequence number 3, we know that the next indexing process needs to index all the files with sequence number greater than 3. But during compaction, files will be rewritten with the starting sequence number. During commit time the sequence number might already gone much higher. For example, I start compaction at seq=3, and when this is running for a few hours, there are 10 inserts done to the table, and the current sequence number is 13. When I commit the compacted data files, those files are essentially written to a sequence number older than the latest. This breaks a lot of assumption like (1) I cannot just find new data to index by calculating if the sequence number is higher than certain value, (2) a reader cannot determine if an index could be used based on the sequence number.
> 
> The solution I was describing is to have another watermark that is monotonically increasing regardless of compaction or not. So Compaction would commit those files at seq=3, but the new watermark of those files are at 14. Then we can use this new watermark for all the index operations.
> 
> Best,
> Jack Ye
> 
> 
> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891104@gmail.com <ma...@gmail.com>> wrote:
> Hi Jack,
> 
> 
> 
> Thanks for the summary and it helps me a lot.
> 
> Trying to understand point 2 and having my 2 cents.
> 
> a mechanism for tracking file change is needed. Unfortunately sequence numbers cannot be used due to the introduction of compaction that rewrites files into a lower sequence number. Another monotonically increasing watermark for files has to be introduced for index change detection and invalidation.
> 
> Please let me know if I have some wrong/silly assumptions.
> 
> So the reason we couldn't use sequence numbers as the validness indicator of the index is compaction. Before compaction (taking a very simple example), the data file and index file should have a mapping and the tableScan.planTask() is able to decide whether to use index purely by comparing sequence numbers (as well as index spec id, if we have one).
> 
> 
> After compaction, the tableScan.planTask() couldn't do so because data file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks might be returned.
> 
> 
> I wonder how an additional watermark only for the index could solve the problem?
> 
> 
> 
> And based on my gut feeling, I feel we could somehow solve the problem with the current sequence number:
> 
> Option 1: When compacting, we could compact those data files that index is up to date to one group, those files that index is stale/not exist to another group. (Just like what we are doing with the data file that are unpartitioned/partition spec id not match).
> 
> 
> 
> The pro is that we could still leverage indexes for part of the data files, and we could reuse the sequence number.
> 
> The cons are that the compaction might not reach the target size and we might still have small files.
> 
> Option 2:
> 
> Assume compaction is often triggered by data engineers and the compaction action is not so frequent. We could directly invalid all index files for those compacted. And the user needs to rebuild the index every time after compaction.
> 
> Pro: Easy to implement, clear to understand.
> 
> Cons: Relatively bad user experience. Waste some computing resources to redo some work.
> 
> Option 3:
> 
> We could leverage the engine's computing resource to always rebuild indexes during data compaction.
> 
> Pro: User could leverage index after the data compaction.
> 
> Cons: Rebuilding might take longer time/resources.
> 
> Option 3 alternative: add a configuration property to compaction, control if the user wants to rebuild the index during compaction. 
> 
> 
> 
> Please let me know if you have any thoughts on this.
> 
> Best,
> 
> Zaicheng
> 
> 
> Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> 于2022年1月26日周三 13:17写道：
> Thanks for the fast responses!
> 
> Based on the conversations above, it sounds like we have the following consensus:
> 
> 1. asynchronous index creation is preferred, although synchronous index creation is possible.
> 2. a mechanism for tracking file change is needed. Unfortunately sequence number cannot be used due to the introduction of compaction that rewrites files into a lower sequence number. Another monotonically increasing watermark for files has to be introduced for index change detection and invalidation.
> 3. index creation and maintenance procedures should be pluggable by different engines. This should not be an issue because Iceberg has been designing action interfaces for different table maintenance procedures so far, so what Zaicheng describes should be the natural development direction once the work is started.
> 
> Regarding index level, I also think partition level index is more important, but it seems like we have to first do file level as the foundation. This leads to the index storage part. I am not talking about using Parquet to store it, I am asking about what Miao is describing. I don't think we have a consensus around the exact place to store index information yet. My memory is that there are 2 ways:
> 1. file level index stored as a binary field in manifest, partition level index stored as a binary field in manifest list. This would only work for small size indexes like bitmap (or bloom filter to certain extent)
> 2. some sort of binary file to store index data, and index metadata (e.g. index type) and pointer to the binary index data file is kept in 1 (I think this is what Miao is describing)
> 3. some sort of index spec to independently store index metadata and data, similar to what we are proposing today for view
> 
> Another aspect of index storage is the index file location in case of 2 and 3. In the original doc a specific file path structure is proposed, whereas this is a bit against the Iceberg standard of not assuming file path to work with any storage. We also need more clarity in that topic.
> 
> Best,
> Jack Ye
> 
> 
> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891104@gmail.com <ma...@gmail.com>> wrote:
> Thanks for having the thread. This is Zaicheng from bytedance. 
> 
> Initially we are planning to add index feature for our internal Trino and feel like iceberg could be the best place for holding/buiding the index data.
> We are very interested in having and contributing to this feature. (Pretty new to the community, still having my 2 cents)
> 
> Echo on what Miao mentioned on 4): I feel iceberg could provide interface for creating/updating/deleting index and each engine can decide how to invoke these method (in a distributed manner or single thread manner, in async or sync). 
> Take our use case as an example, we plan to have a new DDL syntax "create index id_1 on table col_1 using bloom"/"update index id_1 on table col_1", and our SQL engine will create distributed index creation/updating operator. Each operator will invoke the index related method provided by iceberg. 
> 
> Storage): Does the index data have to be a file? Wondering if we want to design the index data storage interface in such way that people can plugin different index storage(file storage/centralized index storage service) later on.
> 
> Thanks,
> Zaicheng
> 
> 
> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created a slack channel for index work. I suggested him adding Anton and you to the channel.
> 
>  
> 
> I still remember some conclusions from previous discussions.
> 
>  
> 
> 1). Index types support: We planned to support Skipping Index first. Iceberg metadata exposes hints whether the tracked data files have index which reduces index reading overhead. Index file can be applied when generating the scan task.
> 
>  
> 
> 2). As Ryan mentioned, Sequence number will be used to indicate whether an index is valid. Sequence number can link the data evolution with index evolution.
> 
>  
> 
> 3). Storage: We planned to have simple file format which includes Column Name/ID, Index Type (String), Index content length, and binary content. It is not necessary to use Parquet to store index. Initial thought was 1 data file mapping to 1 index file. It can be merged to 1 partition mapping to 1 index file. As Ryan said, file level implementation could be a step stone for Partition level implementation.
> 
>  
> 
> 4). How to build index: We want to keep the index reading and writing interface with Iceberg and leave the actual building logic as Engine specific (i.e., we can use different compute to build Index without changing anything inside Iceberg).
> 
>  
> 
> Misc:
> 
> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
> 
> Design doc: https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit <https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit>
> PR should have been merged.
> 
> Guy from IBM did a partial PoC and provided a private doc. I will ask if he can make it public.
> 
>  
> 
> We can continue the discussion and breaking down the big tasks into tickets.
> 
>  
> 
> Thanks!
> 
>  
> 
> Miao
> 
> From: Ryan Blue <blue@tabular.io <ma...@tabular.io>>
> Date: Tuesday, January 25, 2022 at 5:08 PM
> To: Iceberg Dev List <dev@iceberg.apache.org <ma...@iceberg.apache.org>>
> Subject: Re: Continuing the Secondary Index Discussion
> 
> Thanks for raising this for discussion, Jack! It would be great to start adding more indexes.
> 
>  
> 
> > Scope of native index support
> 
>  
> 
> The way I think about it, the biggest challenge here is how to know when you can use an index. For example, if you have a partition index that is up to date as of snapshot 13764091836784, but the current snapshot is 97613097151667, then you basically have no idea what files are covered or not and can't use it. On the other hand, if you know that the index was up to date as of sequence number 11 and you're reading sequence number 12, then you just have to read any data file that was written at sequence number 12.
> 
>  
> 
> The problem of where you can use an index makes me think that it is best to maintain index metadata within Iceberg. An alternative is to try to always keep the index up-to-date, but I don't think that's necessarily possible -- you'd have to support index updates in every writer that touches table data. You would have to spend the time updating indexes at write time, but there are competing priorities like making data available. So I think you want asynchronous index updates and that leads to integration with the table format.
> 
>  
> 
> > Index levels
> 
>  
> 
> I think that partition-level indexes are better for job planning (eliminate whole partitions!) but file-level are still useful for skipping files at the task level. I would probably focus on partition-level, but I'm not strongly opinionated here. File-level is probably a stepping stone to partition-level, given that we would be able to track index data in the same format.
> 
>  
> 
> > Index storage
> 
>  
> 
> Do you mean putting indexes in Parquet, or using Parquet for indexes? I think that bloom filters would probably exceed the amount of data we'd want to put into a Parquet binary column, probably at the file level and almost certainly at the partition level, since the size depends on the number of distinct values and the primary use is for identifiers.
> 
>  
> 
> > Indexing process
> 
>  
> 
> Synchronous is nice, but as I said above, I think we have to support async because it is too complicated to update every writer that touches a table and you may not want to pay the price at write time.
> 
>  
> 
> > Index validation
> 
>  
> 
> I think this is pretty much what I talked about for question 1. I think that we have a good plan around using sequence numbers, if we want to do this.
> 
>  
> 
> Ryan
> 
>  
> 
> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi everyone,
> 
>  
> 
> Based on the conversation in the last community sync and the Iceberg Slack channel, it seems like multiple parties have interest in continuing the effort related to the secondary index in Iceberg, so I would like to restart the thread to continue the discussion.
> 
>  
> 
> So far most people refer to the document authored by Miao Wang <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0> which has a lot of useful information about the design and implementation. However, the document is also quite old (over a year now) and a lot has changed in Iceberg since then. I think the document leaves the following open topics that we need to continue to address:
> 
>  
> 
> 1. scope of native index support: what type of index should Iceberg support natively, how should developers allocate effort between adding support of Iceberg native index compared to developing Iceberg support for holistic indexing projects such as HyperSpace <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>.
> 
>  
> 
> 2. index levels: we have talked about partition level indexing and file level indexing. More clarity is needed for these index levels and the level of interest and support needed for those different indexing levels.
> 
>  
> 
> 3. index storage: we had unsettled debates around making index separated files or embedding it as a part of existing Iceberg file structure. We need to come up with certain criteria such as index size, easiness to generate during write, etc. to settle the discussion.
> 
>  
> 
> 4. Indexing process: as stated in Miao's document, indexes could be created during the data writing process synchronously, or built asynchronously through an index service. Discussion is needed for the focus of the Iceberg index functionalities.
> 
>  
> 
> 5. index invalidation: depends on the scope and level, certain indexes need to be invalidated during operations like RewriteFiles. Clarity is needed in this domain, including if we need another sequence number to track such invalidation.
> 
>  
> 
> I suggest we iterate a bit on this list of open questions, and then we can have a meeting to discuss those aspects, and produce an updated document addressing those aspects to provide a clear path forward for developers interested in adding features in this domain.
> 
>  
> 
> Any thoughts?
> 
>  
> 
> Best,
> 
> Jack Ye
> 
>  
> 
> 
> 
>  
> 
> --
> 
> Ryan Blue
> 
> Tabular
>

Re: [External] Re: Continuing the Secondary Index Discussion

Posted by Zaicheng Wang <wc...@gmail.com>.

Hi PF,

Sure, rescheduled the meeting to an CET friendly time.
The meeting now is scheduled on 9AM, March 11th, PST (6PM CST, March 11th).
The meeting link is meet.google.com/ttd-jzid-abp
Please feel free to slack me or tag me in the slack channel if anyone would
like to get a meeting invitation (or you could directly join the meeting).

Best,
Zaicheng



Piotr Findeisen <pi...@starburstdata.com> 于2022年3月7日周一 21:54写道：

> Hi Zaicheng,
>
> thanks for following up on this. I'm certainly interested.
> The proposed time doesn't work for me though, I'm in the CET time zone.
>
> Best,
> PF
>
>
> On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wc...@gmail.com>
> wrote:
>
>> Hi dev folks,
>>
>> As discussed in the sync
>> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
>> meeting, we will have a dedicated meeting on this topic.
>> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
>> link is https://meet.google.com/ttd-jzid-abp
>> Please let me know if the time does not work for you.
>>
>> Thanks,
>> Zaicheng
>>
>> zaicheng wang <wa...@bytedance.com> 于2022年3月2日周三 21:17写道：
>>
>>> Hi folks,
>>>
>>> This is Zaicheng from bytedance. We spend some time working on solving
>>> the index invalidation problem as we discussed in the dev email channel.
>>> And when we are working on the POC, we also realize there are some
>>> metadata changes that might be introduced.
>>> We put these details into a document:
>>>
>>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
>>> The document includes two proposals for solving the index invalidation
>>> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>>>  another one is by leveraging the current manifest entry structure. The
>>> document will also describe the corresponding table spec change.
>>> Please let me know if you have any thoughts. We could also discuss this
>>> during the sync meeting.
>>>
>>> Thanks,
>>> Zaicheng
>>>
>>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in
>>>> Slack.
>>>>
>>>> The goal here is to have a monotonically increasing number that could
>>>> be used to detect what files have been newly added and should be indexed.
>>>> This is especially important to know how up-to-date an index is for each
>>>> partition.
>>>>
>>>> In a table without compaction, sequence number of files would continue
>>>> to increase. If we have indexed all files up to sequence number 3, we know
>>>> that the next indexing process needs to index all the files with sequence
>>>> number greater than 3. But during compaction, files will be rewritten with
>>>> the starting sequence number. During commit time the sequence number might
>>>> already gone much higher. For example, I start compaction at seq=3, and
>>>> when this is running for a few hours, there are 10 inserts done to the
>>>> table, and the current sequence number is 13. When I commit the compacted
>>>> data files, those files are essentially written to a sequence number older
>>>> than the latest. This breaks a lot of assumption like (1) I cannot just
>>>> find new data to index by calculating if the sequence number is higher than
>>>> certain value, (2) a reader cannot determine if an index could be used
>>>> based on the sequence number.
>>>>
>>>> The solution I was describing is to have another watermark that is
>>>> monotonically increasing regardless of compaction or not. So Compaction
>>>> would commit those files at seq=3, but the new watermark of those files are
>>>> at 14. Then we can use this new watermark for all the index operations.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jack,
>>>>>
>>>>>
>>>>> Thanks for the summary and it helps me a lot.
>>>>>
>>>>> Trying to understand point 2 and having my 2 cents.
>>>>>
>>>>> *a mechanism for tracking file change is needed. Unfortunately
>>>>> sequence numbers cannot be used due to the introduction of compaction that
>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>> increasing watermark for files has to be introduced for index change
>>>>> detection and invalidation.*
>>>>>
>>>>> Please let me know if I have some wrong/silly assumptions.
>>>>>
>>>>> So the *reason* we couldn't use sequence numbers as the validness
>>>>> indicator of the index is compaction. Before compaction (taking a very
>>>>> simple example), the data file and index file should have a mapping and the
>>>>> tableScan.planTask() is able to decide whether to use index purely by
>>>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>>>
>>>>> After compaction, the tableScan.planTask() couldn't do so because data
>>>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>>>>> might be returned.
>>>>>
>>>>> I wonder how an additional watermark only for the index could solve
>>>>> the problem?
>>>>>
>>>>>
>>>>> And based on my gut feeling, I feel we could somehow solve the problem
>>>>> with the current sequence number:
>>>>>
>>>>> *Option 1*: When compacting, we could compact those data files that
>>>>> index is up to date to one group, those files that index is stale/not exist
>>>>> to another group. (Just like what we are doing with the data file that are
>>>>> unpartitioned/partition spec id not match).
>>>>>
>>>>> The *pro* is that we could still leverage indexes for part of the
>>>>> data files, and we could reuse the sequence number.
>>>>>
>>>>> The *cons* are that the compaction might not reach the target size
>>>>> and we might still have small files.
>>>>>
>>>>> *Option 2*:
>>>>>
>>>>> Assume compaction is often triggered by data engineers and the
>>>>> compaction action is not so frequent. We could directly invalid all index
>>>>> files for those compacted. And the user needs to rebuild the index every
>>>>> time after compaction.
>>>>>
>>>>> *Pro*: Easy to implement, clear to understand.
>>>>>
>>>>> *Cons*: Relatively bad user experience. Waste some computing
>>>>> resources to redo some work.
>>>>>
>>>>> *Option 3*:
>>>>>
>>>>> We could leverage the engine's computing resource to always rebuild
>>>>> indexes during data compaction.
>>>>>
>>>>> *Pro*: User could leverage index after the data compaction.
>>>>>
>>>>> *Cons*: Rebuilding might take longer time/resources.
>>>>>
>>>>> *Option 3 alternative*: add a configuration property to compaction,
>>>>> control if the user wants to rebuild the index during compaction.
>>>>>
>>>>>
>>>>> Please let me know if you have any thoughts on this.
>>>>>
>>>>> Best,
>>>>>
>>>>> Zaicheng
>>>>>
>>>>> Jack Ye <ye...@gmail.com> 于2022年1月26日周三 13:17写道：
>>>>>
>>>>>> Thanks for the fast responses!
>>>>>>
>>>>>> Based on the conversations above, it sounds like we have the
>>>>>> following consensus:
>>>>>>
>>>>>> 1. asynchronous index creation is preferred, although synchronous
>>>>>> index creation is possible.
>>>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>>>> sequence number cannot be used due to the introduction of compaction that
>>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>>> increasing watermark for files has to be introduced for index change
>>>>>> detection and invalidation.
>>>>>> 3. index creation and maintenance procedures should be pluggable by
>>>>>> different engines. This should not be an issue because Iceberg has been
>>>>>> designing action interfaces for different table maintenance procedures so
>>>>>> far, so what Zaicheng describes should be the natural development direction
>>>>>> once the work is started.
>>>>>>
>>>>>> Regarding index level, I also think partition level index is more
>>>>>> important, but it seems like we have to first do file level as the
>>>>>> foundation. This leads to the index storage part. I am not talking about
>>>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>>>> don't think we have a consensus around the exact place to store index
>>>>>> information yet. My memory is that there are 2 ways:
>>>>>> 1. file level index stored as a binary field in manifest, partition
>>>>>> level index stored as a binary field in manifest list. This would only work
>>>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>>>> 2. some sort of binary file to store index data, and index metadata
>>>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I
>>>>>> think this is what Miao is describing)
>>>>>> 3. some sort of index spec to independently store index metadata and
>>>>>> data, similar to what we are proposing today for view
>>>>>>
>>>>>> Another aspect of index storage is the index file location in case of
>>>>>> 2 and 3. In the original doc a specific file path structure is proposed,
>>>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>>>> path to work with any storage. We also need more clarity in that topic.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <
>>>>>> wcatp19891104@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>>>
>>>>>>> Initially we are planning to add index feature for our internal
>>>>>>> Trino and feel like iceberg could be the best place for holding/buiding the
>>>>>>> index data.
>>>>>>> We are very interested in having and contributing to this feature.
>>>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>>>
>>>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>>>> interface for creating/updating/deleting index and each engine can decide
>>>>>>> how to invoke these method (in a distributed manner or single thread
>>>>>>> manner, in async or sync).
>>>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>>>>>> col_1", and our SQL engine will create distributed index creation/updating
>>>>>>> operator. Each operator will invoke the index related method provided by
>>>>>>> iceberg.
>>>>>>>
>>>>>>> Storage): Does the index data have to be a file? Wondering if we
>>>>>>> want to design the index data storage interface in such way that people can
>>>>>>> plugin different index storage(file storage/centralized index storage
>>>>>>> service) later on.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zaicheng
>>>>>>>
>>>>>>>
>>>>>>> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>>>>>
>>>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>>>>> you to the channel.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I still remember some conclusions from previous discussions.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1). Index types support: We planned to support Skipping Index
>>>>>>>> first. Iceberg metadata exposes hints whether the tracked data files have
>>>>>>>> index which reduces index reading overhead. Index file can be applied when
>>>>>>>> generating the scan task.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>>>> whether an index is valid. Sequence number can link the data evolution with
>>>>>>>> index evolution.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>>>> content. It is not necessary to use Parquet to store index. Initial thought
>>>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>>>>> mapping to 1 index file. As Ryan said, file level implementation could be a
>>>>>>>> step stone for Partition level implementation.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4). How to build index: We want to keep the index reading and
>>>>>>>> writing interface with Iceberg and leave the actual building logic as
>>>>>>>> Engine specific (i.e., we can use different compute to build Index without
>>>>>>>> changing anything inside Iceberg).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Misc:
>>>>>>>>
>>>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code
>>>>>>>> base.
>>>>>>>>
>>>>>>>> Design doc:
>>>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>>>
>>>>>>>> PR should have been merged.
>>>>>>>>
>>>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will
>>>>>>>> ask if he can make it public.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>>>> tickets.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Miao
>>>>>>>>
>>>>>>>> *From: *Ryan Blue <bl...@tabular.io>
>>>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>
>>>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>>>
>>>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>>>> start adding more indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Scope of native index support
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>>>> when you can use an index. For example, if you have a partition index that
>>>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>>>>> 97613097151667, then you basically have no idea what files are covered or
>>>>>>>> not and can't use it. On the other hand, if you know that the index was up
>>>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>>>> then you just have to read any data file that was written at sequence
>>>>>>>> number 12.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem of where you can use an index makes me think that it is
>>>>>>>> best to maintain index metadata within Iceberg. An alternative is to try to
>>>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>>>> touches table data. You would have to spend the time updating indexes at
>>>>>>>> write time, but there are competing priorities like making data available.
>>>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>>>> integration with the table format.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index levels
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think that partition-level indexes are better for job planning
>>>>>>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>>>>>>> files at the task level. I would probably focus on partition-level, but I'm
>>>>>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>>>>>> partition-level, given that we would be able to track index data in the
>>>>>>>> same format.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index storage
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you mean putting indexes in Parquet, or using Parquet for
>>>>>>>> indexes? I think that bloom filters would probably exceed the amount of
>>>>>>>> data we'd want to put into a Parquet binary column, probably at the file
>>>>>>>> level and almost certainly at the partition level, since the size depends
>>>>>>>> on the number of distinct values and the primary use is for identifiers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Indexing process
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Synchronous is nice, but as I said above, I think we have to
>>>>>>>> support async because it is too complicated to update every writer that
>>>>>>>> touches a table and you may not want to pay the price at write time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index validation
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>>>> think that we have a good plan around using sequence numbers, if we want to
>>>>>>>> do this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <ye...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Based on the conversation in the last community sync and the
>>>>>>>> Iceberg Slack channel, it seems like multiple parties have interest in
>>>>>>>> continuing the effort related to the secondary index in Iceberg, so I would
>>>>>>>> like to restart the thread to continue the discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>>>> which has a lot of useful information about the design and implementation.
>>>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>>>>> open topics that we need to continue to address:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. *scope of native index support*: what type of index should
>>>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>>>> .
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. *index levels*: we have talked about partition level indexing
>>>>>>>> and file level indexing. More clarity is needed for these index levels and
>>>>>>>> the level of interest and support needed for those different indexing
>>>>>>>> levels.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>>>> be created during the data writing process synchronously, or built
>>>>>>>> asynchronously through an index service. Discussion is needed for the focus
>>>>>>>> of the Iceberg index functionalities.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>>>>>>> is needed in this domain, including if we need another sequence number to
>>>>>>>> track such invalidation.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>>>> developers interested in adding features in this domain.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>

Re: [External] Re: Continuing the Secondary Index Discussion

Posted by Piotr Findeisen <pi...@starburstdata.com>.

Hi Zaicheng,

thanks for following up on this. I'm certainly interested.
The proposed time doesn't work for me though, I'm in the CET time zone.

Best,
PF


On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wc...@gmail.com>
wrote:

> Hi dev folks,
>
> As discussed in the sync
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
> meeting, we will have a dedicated meeting on this topic.
> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
> link is https://meet.google.com/ttd-jzid-abp
> Please let me know if the time does not work for you.
>
> Thanks,
> Zaicheng
>
> zaicheng wang <wa...@bytedance.com> 于2022年3月2日周三 21:17写道：
>
>> Hi folks,
>>
>> This is Zaicheng from bytedance. We spend some time working on solving
>> the index invalidation problem as we discussed in the dev email channel.
>> And when we are working on the POC, we also realize there are some
>> metadata changes that might be introduced.
>> We put these details into a document:
>>
>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
>> The document includes two proposals for solving the index invalidation
>> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>>  another one is by leveraging the current manifest entry structure. The
>> document will also describe the corresponding table spec change.
>> Please let me know if you have any thoughts. We could also discuss this
>> during the sync meeting.
>>
>> Thanks,
>> Zaicheng
>>
>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>>>
>>> The goal here is to have a monotonically increasing number that could be
>>> used to detect what files have been newly added and should be indexed. This
>>> is especially important to know how up-to-date an index is for each
>>> partition.
>>>
>>> In a table without compaction, sequence number of files would continue
>>> to increase. If we have indexed all files up to sequence number 3, we know
>>> that the next indexing process needs to index all the files with sequence
>>> number greater than 3. But during compaction, files will be rewritten with
>>> the starting sequence number. During commit time the sequence number might
>>> already gone much higher. For example, I start compaction at seq=3, and
>>> when this is running for a few hours, there are 10 inserts done to the
>>> table, and the current sequence number is 13. When I commit the compacted
>>> data files, those files are essentially written to a sequence number older
>>> than the latest. This breaks a lot of assumption like (1) I cannot just
>>> find new data to index by calculating if the sequence number is higher than
>>> certain value, (2) a reader cannot determine if an index could be used
>>> based on the sequence number.
>>>
>>> The solution I was describing is to have another watermark that is
>>> monotonically increasing regardless of compaction or not. So Compaction
>>> would commit those files at seq=3, but the new watermark of those files are
>>> at 14. Then we can use this new watermark for all the index operations.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wc...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jack,
>>>>
>>>>
>>>> Thanks for the summary and it helps me a lot.
>>>>
>>>> Trying to understand point 2 and having my 2 cents.
>>>>
>>>> *a mechanism for tracking file change is needed. Unfortunately sequence
>>>> numbers cannot be used due to the introduction of compaction that rewrites
>>>> files into a lower sequence number. Another monotonically increasing
>>>> watermark for files has to be introduced for index change detection and
>>>> invalidation.*
>>>>
>>>> Please let me know if I have some wrong/silly assumptions.
>>>>
>>>> So the *reason* we couldn't use sequence numbers as the validness
>>>> indicator of the index is compaction. Before compaction (taking a very
>>>> simple example), the data file and index file should have a mapping and the
>>>> tableScan.planTask() is able to decide whether to use index purely by
>>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>>
>>>> After compaction, the tableScan.planTask() couldn't do so because data
>>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>>>> might be returned.
>>>>
>>>> I wonder how an additional watermark only for the index could solve the
>>>> problem?
>>>>
>>>>
>>>> And based on my gut feeling, I feel we could somehow solve the problem
>>>> with the current sequence number:
>>>>
>>>> *Option 1*: When compacting, we could compact those data files that
>>>> index is up to date to one group, those files that index is stale/not exist
>>>> to another group. (Just like what we are doing with the data file that are
>>>> unpartitioned/partition spec id not match).
>>>>
>>>> The *pro* is that we could still leverage indexes for part of the data
>>>> files, and we could reuse the sequence number.
>>>>
>>>> The *cons* are that the compaction might not reach the target size and
>>>> we might still have small files.
>>>>
>>>> *Option 2*:
>>>>
>>>> Assume compaction is often triggered by data engineers and the
>>>> compaction action is not so frequent. We could directly invalid all index
>>>> files for those compacted. And the user needs to rebuild the index every
>>>> time after compaction.
>>>>
>>>> *Pro*: Easy to implement, clear to understand.
>>>>
>>>> *Cons*: Relatively bad user experience. Waste some computing resources
>>>> to redo some work.
>>>>
>>>> *Option 3*:
>>>>
>>>> We could leverage the engine's computing resource to always rebuild
>>>> indexes during data compaction.
>>>>
>>>> *Pro*: User could leverage index after the data compaction.
>>>>
>>>> *Cons*: Rebuilding might take longer time/resources.
>>>>
>>>> *Option 3 alternative*: add a configuration property to compaction,
>>>> control if the user wants to rebuild the index during compaction.
>>>>
>>>>
>>>> Please let me know if you have any thoughts on this.
>>>>
>>>> Best,
>>>>
>>>> Zaicheng
>>>>
>>>> Jack Ye <ye...@gmail.com> 于2022年1月26日周三 13:17写道：
>>>>
>>>>> Thanks for the fast responses!
>>>>>
>>>>> Based on the conversations above, it sounds like we have the following
>>>>> consensus:
>>>>>
>>>>> 1. asynchronous index creation is preferred, although synchronous
>>>>> index creation is possible.
>>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>>> sequence number cannot be used due to the introduction of compaction that
>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>> increasing watermark for files has to be introduced for index change
>>>>> detection and invalidation.
>>>>> 3. index creation and maintenance procedures should be pluggable by
>>>>> different engines. This should not be an issue because Iceberg has been
>>>>> designing action interfaces for different table maintenance procedures so
>>>>> far, so what Zaicheng describes should be the natural development direction
>>>>> once the work is started.
>>>>>
>>>>> Regarding index level, I also think partition level index is more
>>>>> important, but it seems like we have to first do file level as the
>>>>> foundation. This leads to the index storage part. I am not talking about
>>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>>> don't think we have a consensus around the exact place to store index
>>>>> information yet. My memory is that there are 2 ways:
>>>>> 1. file level index stored as a binary field in manifest, partition
>>>>> level index stored as a binary field in manifest list. This would only work
>>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>>> 2. some sort of binary file to store index data, and index metadata
>>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I
>>>>> think this is what Miao is describing)
>>>>> 3. some sort of index spec to independently store index metadata and
>>>>> data, similar to what we are proposing today for view
>>>>>
>>>>> Another aspect of index storage is the index file location in case of
>>>>> 2 and 3. In the original doc a specific file path structure is proposed,
>>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>>> path to work with any storage. We also need more clarity in that topic.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wc...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>>
>>>>>> Initially we are planning to add index feature for our internal Trino
>>>>>> and feel like iceberg could be the best place for holding/buiding the index
>>>>>> data.
>>>>>> We are very interested in having and contributing to this feature.
>>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>>
>>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>>> interface for creating/updating/deleting index and each engine can decide
>>>>>> how to invoke these method (in a distributed manner or single thread
>>>>>> manner, in async or sync).
>>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>>>>> col_1", and our SQL engine will create distributed index creation/updating
>>>>>> operator. Each operator will invoke the index related method provided by
>>>>>> iceberg.
>>>>>>
>>>>>> Storage): Does the index data have to be a file? Wondering if we want
>>>>>> to design the index data storage interface in such way that people can
>>>>>> plugin different index storage(file storage/centralized index storage
>>>>>> service) later on.
>>>>>>
>>>>>> Thanks,
>>>>>> Zaicheng
>>>>>>
>>>>>>
>>>>>> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>>>>
>>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>>>> you to the channel.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I still remember some conclusions from previous discussions.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1). Index types support: We planned to support Skipping Index first.
>>>>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>>>>> which reduces index reading overhead. Index file can be applied when
>>>>>>> generating the scan task.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>>> whether an index is valid. Sequence number can link the data evolution with
>>>>>>> index evolution.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>>> content. It is not necessary to use Parquet to store index. Initial thought
>>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>>>> mapping to 1 index file. As Ryan said, file level implementation could be a
>>>>>>> step stone for Partition level implementation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 4). How to build index: We want to keep the index reading and
>>>>>>> writing interface with Iceberg and leave the actual building logic as
>>>>>>> Engine specific (i.e., we can use different compute to build Index without
>>>>>>> changing anything inside Iceberg).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Misc:
>>>>>>>
>>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code
>>>>>>> base.
>>>>>>>
>>>>>>> Design doc:
>>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>>
>>>>>>> PR should have been merged.
>>>>>>>
>>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will
>>>>>>> ask if he can make it public.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>>> tickets.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Miao
>>>>>>>
>>>>>>> *From: *Ryan Blue <bl...@tabular.io>
>>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>
>>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>>
>>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>>> start adding more indexes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Scope of native index support
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>>> when you can use an index. For example, if you have a partition index that
>>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>>>> 97613097151667, then you basically have no idea what files are covered or
>>>>>>> not and can't use it. On the other hand, if you know that the index was up
>>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>>> then you just have to read any data file that was written at sequence
>>>>>>> number 12.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The problem of where you can use an index makes me think that it is
>>>>>>> best to maintain index metadata within Iceberg. An alternative is to try to
>>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>>> touches table data. You would have to spend the time updating indexes at
>>>>>>> write time, but there are competing priorities like making data available.
>>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>>> integration with the table format.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Index levels
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think that partition-level indexes are better for job planning
>>>>>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>>>>>> files at the task level. I would probably focus on partition-level, but I'm
>>>>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>>>>> partition-level, given that we would be able to track index data in the
>>>>>>> same format.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Index storage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Do you mean putting indexes in Parquet, or using Parquet for
>>>>>>> indexes? I think that bloom filters would probably exceed the amount of
>>>>>>> data we'd want to put into a Parquet binary column, probably at the file
>>>>>>> level and almost certainly at the partition level, since the size depends
>>>>>>> on the number of distinct values and the primary use is for identifiers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Indexing process
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Synchronous is nice, but as I said above, I think we have to support
>>>>>>> async because it is too complicated to update every writer that touches a
>>>>>>> table and you may not want to pay the price at write time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Index validation
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>>> think that we have a good plan around using sequence numbers, if we want to
>>>>>>> do this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Based on the conversation in the last community sync and the Iceberg
>>>>>>> Slack channel, it seems like multiple parties have interest in continuing
>>>>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>>>>> restart the thread to continue the discussion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>>> which has a lot of useful information about the design and implementation.
>>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>>>> open topics that we need to continue to address:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1. *scope of native index support*: what type of index should
>>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>>> .
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. *index levels*: we have talked about partition level indexing
>>>>>>> and file level indexing. More clarity is needed for these index levels and
>>>>>>> the level of interest and support needed for those different indexing
>>>>>>> levels.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>>> be created during the data writing process synchronously, or built
>>>>>>> asynchronously through an index service. Discussion is needed for the focus
>>>>>>> of the Iceberg index functionalities.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>>>>>> is needed in this domain, including if we need another sequence number to
>>>>>>> track such invalidation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>>> developers interested in adding features in this domain.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>

Re: [External] Re: Continuing the Secondary Index Discussion

Posted by Zaicheng Wang <wc...@gmail.com>.

Hi dev folks,

As discussed in the sync
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
meeting, we will have a dedicated meeting on this topic.
I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
link is https://meet.google.com/ttd-jzid-abp
Please let me know if the time does not work for you.

Thanks,
Zaicheng

zaicheng wang <wa...@bytedance.com> 于2022年3月2日周三 21:17写道：

> Hi folks,
>
> This is Zaicheng from bytedance. We spend some time working on solving the
> index invalidation problem as we discussed in the dev email channel.
> And when we are working on the POC, we also realize there are some
> metadata changes that might be introduced.
> We put these details into a document:
>
> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
> The document includes two proposals for solving the index invalidation
> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>  another one is by leveraging the current manifest entry structure. The
> document will also describe the corresponding table spec change.
> Please let me know if you have any thoughts. We could also discuss this
> during the sync meeting.
>
> Thanks,
> Zaicheng
>
> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <ye...@gmail.com> wrote:
>
>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>>
>> The goal here is to have a monotonically increasing number that could be
>> used to detect what files have been newly added and should be indexed. This
>> is especially important to know how up-to-date an index is for each
>> partition.
>>
>> In a table without compaction, sequence number of files would continue to
>> increase. If we have indexed all files up to sequence number 3, we know
>> that the next indexing process needs to index all the files with sequence
>> number greater than 3. But during compaction, files will be rewritten with
>> the starting sequence number. During commit time the sequence number might
>> already gone much higher. For example, I start compaction at seq=3, and
>> when this is running for a few hours, there are 10 inserts done to the
>> table, and the current sequence number is 13. When I commit the compacted
>> data files, those files are essentially written to a sequence number older
>> than the latest. This breaks a lot of assumption like (1) I cannot just
>> find new data to index by calculating if the sequence number is higher than
>> certain value, (2) a reader cannot determine if an index could be used
>> based on the sequence number.
>>
>> The solution I was describing is to have another watermark that is
>> monotonically increasing regardless of compaction or not. So Compaction
>> would commit those files at seq=3, but the new watermark of those files are
>> at 14. Then we can use this new watermark for all the index operations.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wc...@gmail.com>
>> wrote:
>>
>>> Hi Jack,
>>>
>>>
>>> Thanks for the summary and it helps me a lot.
>>>
>>> Trying to understand point 2 and having my 2 cents.
>>>
>>> *a mechanism for tracking file change is needed. Unfortunately sequence
>>> numbers cannot be used due to the introduction of compaction that rewrites
>>> files into a lower sequence number. Another monotonically increasing
>>> watermark for files has to be introduced for index change detection and
>>> invalidation.*
>>>
>>> Please let me know if I have some wrong/silly assumptions.
>>>
>>> So the *reason* we couldn't use sequence numbers as the validness
>>> indicator of the index is compaction. Before compaction (taking a very
>>> simple example), the data file and index file should have a mapping and the
>>> tableScan.planTask() is able to decide whether to use index purely by
>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>
>>> After compaction, the tableScan.planTask() couldn't do so because data
>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>>> might be returned.
>>>
>>> I wonder how an additional watermark only for the index could solve the
>>> problem?
>>>
>>>
>>> And based on my gut feeling, I feel we could somehow solve the problem
>>> with the current sequence number:
>>>
>>> *Option 1*: When compacting, we could compact those data files that
>>> index is up to date to one group, those files that index is stale/not exist
>>> to another group. (Just like what we are doing with the data file that are
>>> unpartitioned/partition spec id not match).
>>>
>>> The *pro* is that we could still leverage indexes for part of the data
>>> files, and we could reuse the sequence number.
>>>
>>> The *cons* are that the compaction might not reach the target size and
>>> we might still have small files.
>>>
>>> *Option 2*:
>>>
>>> Assume compaction is often triggered by data engineers and the
>>> compaction action is not so frequent. We could directly invalid all index
>>> files for those compacted. And the user needs to rebuild the index every
>>> time after compaction.
>>>
>>> *Pro*: Easy to implement, clear to understand.
>>>
>>> *Cons*: Relatively bad user experience. Waste some computing resources
>>> to redo some work.
>>>
>>> *Option 3*:
>>>
>>> We could leverage the engine's computing resource to always rebuild
>>> indexes during data compaction.
>>>
>>> *Pro*: User could leverage index after the data compaction.
>>>
>>> *Cons*: Rebuilding might take longer time/resources.
>>>
>>> *Option 3 alternative*: add a configuration property to compaction,
>>> control if the user wants to rebuild the index during compaction.
>>>
>>>
>>> Please let me know if you have any thoughts on this.
>>>
>>> Best,
>>>
>>> Zaicheng
>>>
>>> Jack Ye <ye...@gmail.com> 于2022年1月26日周三 13:17写道：
>>>
>>>> Thanks for the fast responses!
>>>>
>>>> Based on the conversations above, it sounds like we have the following
>>>> consensus:
>>>>
>>>> 1. asynchronous index creation is preferred, although synchronous index
>>>> creation is possible.
>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>> sequence number cannot be used due to the introduction of compaction that
>>>> rewrites files into a lower sequence number. Another monotonically
>>>> increasing watermark for files has to be introduced for index change
>>>> detection and invalidation.
>>>> 3. index creation and maintenance procedures should be pluggable by
>>>> different engines. This should not be an issue because Iceberg has been
>>>> designing action interfaces for different table maintenance procedures so
>>>> far, so what Zaicheng describes should be the natural development direction
>>>> once the work is started.
>>>>
>>>> Regarding index level, I also think partition level index is more
>>>> important, but it seems like we have to first do file level as the
>>>> foundation. This leads to the index storage part. I am not talking about
>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>> don't think we have a consensus around the exact place to store index
>>>> information yet. My memory is that there are 2 ways:
>>>> 1. file level index stored as a binary field in manifest, partition
>>>> level index stored as a binary field in manifest list. This would only work
>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>> 2. some sort of binary file to store index data, and index metadata
>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I
>>>> think this is what Miao is describing)
>>>> 3. some sort of index spec to independently store index metadata and
>>>> data, similar to what we are proposing today for view
>>>>
>>>> Another aspect of index storage is the index file location in case of 2
>>>> and 3. In the original doc a specific file path structure is proposed,
>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>> path to work with any storage. We also need more clarity in that topic.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>
>>>>> Initially we are planning to add index feature for our internal Trino
>>>>> and feel like iceberg could be the best place for holding/buiding the index
>>>>> data.
>>>>> We are very interested in having and contributing to this feature.
>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>
>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>> interface for creating/updating/deleting index and each engine can decide
>>>>> how to invoke these method (in a distributed manner or single thread
>>>>> manner, in async or sync).
>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>>>> col_1", and our SQL engine will create distributed index creation/updating
>>>>> operator. Each operator will invoke the index related method provided by
>>>>> iceberg.
>>>>>
>>>>> Storage): Does the index data have to be a file? Wondering if we want
>>>>> to design the index data storage interface in such way that people can
>>>>> plugin different index storage(file storage/centralized index storage
>>>>> service) later on.
>>>>>
>>>>> Thanks,
>>>>> Zaicheng
>>>>>
>>>>>
>>>>> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>>>
>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>>> you to the channel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I still remember some conclusions from previous discussions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1). Index types support: We planned to support Skipping Index first.
>>>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>>>> which reduces index reading overhead. Index file can be applied when
>>>>>> generating the scan task.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>> whether an index is valid. Sequence number can link the data evolution with
>>>>>> index evolution.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>> content. It is not necessary to use Parquet to store index. Initial thought
>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>>> mapping to 1 index file. As Ryan said, file level implementation could be a
>>>>>> step stone for Partition level implementation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4). How to build index: We want to keep the index reading and writing
>>>>>> interface with Iceberg and leave the actual building logic as Engine
>>>>>> specific (i.e., we can use different compute to build Index without
>>>>>> changing anything inside Iceberg).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Misc:
>>>>>>
>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>>>>>
>>>>>> Design doc:
>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>
>>>>>> PR should have been merged.
>>>>>>
>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will ask
>>>>>> if he can make it public.
>>>>>>
>>>>>>
>>>>>>
>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>> tickets.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Miao
>>>>>>
>>>>>> *From: *Ryan Blue <bl...@tabular.io>
>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>
>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>
>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>> start adding more indexes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Scope of native index support
>>>>>>
>>>>>>
>>>>>>
>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>> when you can use an index. For example, if you have a partition index that
>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>>> 97613097151667, then you basically have no idea what files are covered or
>>>>>> not and can't use it. On the other hand, if you know that the index was up
>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>> then you just have to read any data file that was written at sequence
>>>>>> number 12.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem of where you can use an index makes me think that it is
>>>>>> best to maintain index metadata within Iceberg. An alternative is to try to
>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>> touches table data. You would have to spend the time updating indexes at
>>>>>> write time, but there are competing priorities like making data available.
>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>> integration with the table format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index levels
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think that partition-level indexes are better for job planning
>>>>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>>>>> files at the task level. I would probably focus on partition-level, but I'm
>>>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>>>> partition-level, given that we would be able to track index data in the
>>>>>> same format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index storage
>>>>>>
>>>>>>
>>>>>>
>>>>>> Do you mean putting indexes in Parquet, or using Parquet for indexes?
>>>>>> I think that bloom filters would probably exceed the amount of data we'd
>>>>>> want to put into a Parquet binary column, probably at the file level and
>>>>>> almost certainly at the partition level, since the size depends on the
>>>>>> number of distinct values and the primary use is for identifiers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Indexing process
>>>>>>
>>>>>>
>>>>>>
>>>>>> Synchronous is nice, but as I said above, I think we have to support
>>>>>> async because it is too complicated to update every writer that touches a
>>>>>> table and you may not want to pay the price at write time.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index validation
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>> think that we have a good plan around using sequence numbers, if we want to
>>>>>> do this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Based on the conversation in the last community sync and the Iceberg
>>>>>> Slack channel, it seems like multiple parties have interest in continuing
>>>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>>>> restart the thread to continue the discussion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>> which has a lot of useful information about the design and implementation.
>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>>> open topics that we need to continue to address:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1. *scope of native index support*: what type of index should
>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2. *index levels*: we have talked about partition level indexing and
>>>>>> file level indexing. More clarity is needed for these index levels and the
>>>>>> level of interest and support needed for those different indexing levels.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>> be created during the data writing process synchronously, or built
>>>>>> asynchronously through an index service. Discussion is needed for the focus
>>>>>> of the Iceberg index functionalities.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>>>>> is needed in this domain, including if we need another sequence number to
>>>>>> track such invalidation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>> developers interested in adding features in this domain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ryan Blue
>>>>>>
>>>>>> Tabular
>>>>>>
>>>>>

Re: [External] Re: Continuing the Secondary Index Discussion

Posted by zaicheng wang <wa...@bytedance.com>.

Hi folks,

This is Zaicheng from bytedance. We spend some time working on solving the
index invalidation problem as we discussed in the dev email channel.
And when we are working on the POC, we also realize there are some metadata
changes that might be introduced.
We put these details into a document:
https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
The document includes two proposals for solving the index invalidation
problem: one from @Jack Ye’s idea on introducing a new sequence number,
 another one is by leveraging the current manifest entry structure. The
document will also describe the corresponding table spec change.
Please let me know if you have any thoughts. We could also discuss this
during the sync meeting.

Thanks,
Zaicheng

On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <ye...@gmail.com> wrote:

> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>
> The goal here is to have a monotonically increasing number that could be
> used to detect what files have been newly added and should be indexed. This
> is especially important to know how up-to-date an index is for each
> partition.
>
> In a table without compaction, sequence number of files would continue to
> increase. If we have indexed all files up to sequence number 3, we know
> that the next indexing process needs to index all the files with sequence
> number greater than 3. But during compaction, files will be rewritten with
> the starting sequence number. During commit time the sequence number might
> already gone much higher. For example, I start compaction at seq=3, and
> when this is running for a few hours, there are 10 inserts done to the
> table, and the current sequence number is 13. When I commit the compacted
> data files, those files are essentially written to a sequence number older
> than the latest. This breaks a lot of assumption like (1) I cannot just
> find new data to index by calculating if the sequence number is higher than
> certain value, (2) a reader cannot determine if an index could be used
> based on the sequence number.
>
> The solution I was describing is to have another watermark that is
> monotonically increasing regardless of compaction or not. So Compaction
> would commit those files at seq=3, but the new watermark of those files are
> at 14. Then we can use this new watermark for all the index operations.
>
> Best,
> Jack Ye
>
>
> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wc...@gmail.com>
> wrote:
>
>> Hi Jack,
>>
>>
>> Thanks for the summary and it helps me a lot.
>>
>> Trying to understand point 2 and having my 2 cents.
>>
>> *a mechanism for tracking file change is needed. Unfortunately sequence
>> numbers cannot be used due to the introduction of compaction that rewrites
>> files into a lower sequence number. Another monotonically increasing
>> watermark for files has to be introduced for index change detection and
>> invalidation.*
>>
>> Please let me know if I have some wrong/silly assumptions.
>>
>> So the *reason* we couldn't use sequence numbers as the validness
>> indicator of the index is compaction. Before compaction (taking a very
>> simple example), the data file and index file should have a mapping and the
>> tableScan.planTask() is able to decide whether to use index purely by
>> comparing sequence numbers (as well as index spec id, if we have one).
>>
>> After compaction, the tableScan.planTask() couldn't do so because data
>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>> might be returned.
>>
>> I wonder how an additional watermark only for the index could solve the
>> problem?
>>
>>
>> And based on my gut feeling, I feel we could somehow solve the problem
>> with the current sequence number:
>>
>> *Option 1*: When compacting, we could compact those data files that
>> index is up to date to one group, those files that index is stale/not exist
>> to another group. (Just like what we are doing with the data file that are
>> unpartitioned/partition spec id not match).
>>
>> The *pro* is that we could still leverage indexes for part of the data
>> files, and we could reuse the sequence number.
>>
>> The *cons* are that the compaction might not reach the target size and
>> we might still have small files.
>>
>> *Option 2*:
>>
>> Assume compaction is often triggered by data engineers and the compaction
>> action is not so frequent. We could directly invalid all index files for
>> those compacted. And the user needs to rebuild the index every time after
>> compaction.
>>
>> *Pro*: Easy to implement, clear to understand.
>>
>> *Cons*: Relatively bad user experience. Waste some computing resources
>> to redo some work.
>>
>> *Option 3*:
>>
>> We could leverage the engine's computing resource to always rebuild
>> indexes during data compaction.
>>
>> *Pro*: User could leverage index after the data compaction.
>>
>> *Cons*: Rebuilding might take longer time/resources.
>>
>> *Option 3 alternative*: add a configuration property to compaction,
>> control if the user wants to rebuild the index during compaction.
>>
>>
>> Please let me know if you have any thoughts on this.
>>
>> Best,
>>
>> Zaicheng
>>
>> Jack Ye <ye...@gmail.com> 于2022年1月26日周三 13:17写道：
>>
>>> Thanks for the fast responses!
>>>
>>> Based on the conversations above, it sounds like we have the following
>>> consensus:
>>>
>>> 1. asynchronous index creation is preferred, although synchronous index
>>> creation is possible.
>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>> sequence number cannot be used due to the introduction of compaction that
>>> rewrites files into a lower sequence number. Another monotonically
>>> increasing watermark for files has to be introduced for index change
>>> detection and invalidation.
>>> 3. index creation and maintenance procedures should be pluggable by
>>> different engines. This should not be an issue because Iceberg has been
>>> designing action interfaces for different table maintenance procedures so
>>> far, so what Zaicheng describes should be the natural development direction
>>> once the work is started.
>>>
>>> Regarding index level, I also think partition level index is more
>>> important, but it seems like we have to first do file level as the
>>> foundation. This leads to the index storage part. I am not talking about
>>> using Parquet to store it, I am asking about what Miao is describing. I
>>> don't think we have a consensus around the exact place to store index
>>> information yet. My memory is that there are 2 ways:
>>> 1. file level index stored as a binary field in manifest, partition
>>> level index stored as a binary field in manifest list. This would only work
>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>> 2. some sort of binary file to store index data, and index metadata
>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I
>>> think this is what Miao is describing)
>>> 3. some sort of index spec to independently store index metadata and
>>> data, similar to what we are proposing today for view
>>>
>>> Another aspect of index storage is the index file location in case of 2
>>> and 3. In the original doc a specific file path structure is proposed,
>>> whereas this is a bit against the Iceberg standard of not assuming file
>>> path to work with any storage. We also need more clarity in that topic.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wc...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>
>>>> Initially we are planning to add index feature for our internal Trino
>>>> and feel like iceberg could be the best place for holding/buiding the index
>>>> data.
>>>> We are very interested in having and contributing to this feature.
>>>> (Pretty new to the community, still having my 2 cents)
>>>>
>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>> interface for creating/updating/deleting index and each engine can decide
>>>> how to invoke these method (in a distributed manner or single thread
>>>> manner, in async or sync).
>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>>> col_1", and our SQL engine will create distributed index creation/updating
>>>> operator. Each operator will invoke the index related method provided by
>>>> iceberg.
>>>>
>>>> Storage): Does the index data have to be a file? Wondering if we want
>>>> to design the index data storage interface in such way that people can
>>>> plugin different index storage(file storage/centralized index storage
>>>> service) later on.
>>>>
>>>> Thanks,
>>>> Zaicheng
>>>>
>>>>
>>>> Miao Wang <mi...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>>
>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>> you to the channel.
>>>>>
>>>>>
>>>>>
>>>>> I still remember some conclusions from previous discussions.
>>>>>
>>>>>
>>>>>
>>>>> 1). Index types support: We planned to support Skipping Index first.
>>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>>> which reduces index reading overhead. Index file can be applied when
>>>>> generating the scan task.
>>>>>
>>>>>
>>>>>
>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>> whether an index is valid. Sequence number can link the data evolution with
>>>>> index evolution.
>>>>>
>>>>>
>>>>>
>>>>> 3). Storage: We planned to have simple file format which includes
>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>> content. It is not necessary to use Parquet to store index. Initial thought
>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>> mapping to 1 index file. As Ryan said, file level implementation could be a
>>>>> step stone for Partition level implementation.
>>>>>
>>>>>
>>>>>
>>>>> 4). How to build index: We want to keep the index reading and writing
>>>>> interface with Iceberg and leave the actual building logic as Engine
>>>>> specific (i.e., we can use different compute to build Index without
>>>>> changing anything inside Iceberg).
>>>>>
>>>>>
>>>>>
>>>>> Misc:
>>>>>
>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>>>>
>>>>> Design doc:
>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>
>>>>> PR should have been merged.
>>>>>
>>>>> Guy from IBM did a partial PoC and provided a private doc. I will ask
>>>>> if he can make it public.
>>>>>
>>>>>
>>>>>
>>>>> We can continue the discussion and breaking down the big tasks into
>>>>> tickets.
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> Miao
>>>>>
>>>>> *From: *Ryan Blue <bl...@tabular.io>
>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>
>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>
>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>> start adding more indexes.
>>>>>
>>>>>
>>>>>
>>>>> > Scope of native index support
>>>>>
>>>>>
>>>>>
>>>>> The way I think about it, the biggest challenge here is how to know
>>>>> when you can use an index. For example, if you have a partition index that
>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>> 97613097151667, then you basically have no idea what files are covered or
>>>>> not and can't use it. On the other hand, if you know that the index was up
>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>> then you just have to read any data file that was written at sequence
>>>>> number 12.
>>>>>
>>>>>
>>>>>
>>>>> The problem of where you can use an index makes me think that it is
>>>>> best to maintain index metadata within Iceberg. An alternative is to try to
>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>> possible -- you'd have to support index updates in every writer that
>>>>> touches table data. You would have to spend the time updating indexes at
>>>>> write time, but there are competing priorities like making data available.
>>>>> So I think you want asynchronous index updates and that leads to
>>>>> integration with the table format.
>>>>>
>>>>>
>>>>>
>>>>> > Index levels
>>>>>
>>>>>
>>>>>
>>>>> I think that partition-level indexes are better for job planning
>>>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>>>> files at the task level. I would probably focus on partition-level, but I'm
>>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>>> partition-level, given that we would be able to track index data in the
>>>>> same format.
>>>>>
>>>>>
>>>>>
>>>>> > Index storage
>>>>>
>>>>>
>>>>>
>>>>> Do you mean putting indexes in Parquet, or using Parquet for indexes?
>>>>> I think that bloom filters would probably exceed the amount of data we'd
>>>>> want to put into a Parquet binary column, probably at the file level and
>>>>> almost certainly at the partition level, since the size depends on the
>>>>> number of distinct values and the primary use is for identifiers.
>>>>>
>>>>>
>>>>>
>>>>> > Indexing process
>>>>>
>>>>>
>>>>>
>>>>> Synchronous is nice, but as I said above, I think we have to support
>>>>> async because it is too complicated to update every writer that touches a
>>>>> table and you may not want to pay the price at write time.
>>>>>
>>>>>
>>>>>
>>>>> > Index validation
>>>>>
>>>>>
>>>>>
>>>>> I think this is pretty much what I talked about for question 1. I
>>>>> think that we have a good plan around using sequence numbers, if we want to
>>>>> do this.
>>>>>
>>>>>
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>>
>>>>>
>>>>> Based on the conversation in the last community sync and the Iceberg
>>>>> Slack channel, it seems like multiple parties have interest in continuing
>>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>>> restart the thread to continue the discussion.
>>>>>
>>>>>
>>>>>
>>>>> So far most people refer to the document authored by Miao Wang
>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>> which has a lot of useful information about the design and implementation.
>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>> open topics that we need to continue to address:
>>>>>
>>>>>
>>>>>
>>>>> 1. *scope of native index support*: what type of index should Iceberg
>>>>> support natively, how should developers allocate effort between adding
>>>>> support of Iceberg native index compared to developing Iceberg support for
>>>>> holistic indexing projects such as HyperSpace
>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>> 2. *index levels*: we have talked about partition level indexing and
>>>>> file level indexing. More clarity is needed for these index levels and the
>>>>> level of interest and support needed for those different indexing levels.
>>>>>
>>>>>
>>>>>
>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>> structure. We need to come up with certain criteria such as index size,
>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>
>>>>>
>>>>>
>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could be
>>>>> created during the data writing process synchronously, or built
>>>>> asynchronously through an index service. Discussion is needed for the focus
>>>>> of the Iceberg index functionalities.
>>>>>
>>>>>
>>>>>
>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>>>> is needed in this domain, including if we need another sequence number to
>>>>> track such invalidation.
>>>>>
>>>>>
>>>>>
>>>>> I suggest we iterate a bit on this list of open questions, and then we
>>>>> can have a meeting to discuss those aspects, and produce an updated
>>>>> document addressing those aspects to provide a clear path forward for
>>>>> developers interested in adding features in this domain.
>>>>>
>>>>>
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Ryan Blue
>>>>>
>>>>> Tabular
>>>>>
>>>>