You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Anton Okolnychyi <ao...@apple.com.INVALID> on 2022/03/07 21:30:14 UTC

Re: Change Data Capture for Iceberg

Hey folks, 

Based on Yufei’s design doc and what we discussed during the sync, I shared my thoughts on what can be efficiently supported right now.

https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554 <https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554>

I’d be interested to learn more about specific use cases that would violate the assumptions I listed in my comment. If you have such a use case in mind, please, comment on the issue.

- Anton


> On 24 Feb 2022, at 14:57, Yufei Gu <fl...@gmail.com> wrote:
> 
> Hi everyone,
> 
> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST due to an unexpected event. The meeting link will be the same, meet.google.com/vam-cmfx-feo <http://meet.google.com/vam-cmfx-feo>. Thanks!
> 
> Best,
> 
> Yufei
> 
> 
> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <flyrain000@gmail.com <ma...@gmail.com>> wrote:
> Hi everyone,
> 
> It's great to see a lot of interest in the design.
> We are planning to have a meeting to discuss Iceberg CDC design on Friday(2/25) 9-10am PST. The meeting link is meet.google.com/vam-cmfx-feo <http://meet.google.com/vam-cmfx-feo>. We will talk about the general idea, as well as open questions. The meeting will be recorded.
> Best,
> Yufei
> 
> 
> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>> wrote:
> Oh cool, I have not had a chance to review much of this, but I was having a conversation with a team which wanted similar features for a table so excited to see folks working on it 👍
> 
> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <flyrain000@gmail.com <ma...@gmail.com>> wrote:
> Hi team,
> 
> We propose a way to generate the CDC records from the Iceberg tables. It is an approach without table spec change and write-time logging. It will cover the majority of CDC use cases, but no guarantee to all of them. We believe it's a good start point to approach CDC in the Iceberg. Any feedback is welcome!
> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing <https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing>
> 
> Best,
> 
> Yufei
> -- 
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau <https://www.youtube.com/user/holdenkarau>

Re: Change Data Capture for Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
I have to change the meeting to next Monday(May 2) due to a conflict. Sorry
about that.

Change Data Capture for Iceberg
Monday, May 2 · 9:00 – 10:00am
Google Meet joining info
Video call link: https://meet.google.com/pjv-cspg-xos

Best,

Yufei

`This is not a contribution`


On Tue, Apr 26, 2022 at 12:18 PM Yufei Gu <fl...@gmail.com> wrote:

> Hi everyone,
>
> Here is the Change Data Capture update. I posted a draft PR(
> https://github.com/apache/iceberg/pull/4539) 2 weeks ago, and got lots of
> reviews. Thank you all for the review. Based on the feedback, we will move
> forward with the approach and fire separated formal PRs. We are also
> planning to have a meeting to share the general idea of the approach, and
> next steps. Looking forward to seeing you there. Here is the meeting infor.
>
> Change Data Capture for Iceberg
> Friday, April 29 · 9:00 – 10:00am
> Google Meet joining info
> Video call link: https://meet.google.com/pjv-cspg-xos
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Tue, Mar 29, 2022 at 4:33 PM Yufei Gu <fl...@gmail.com> wrote:
>
>> Synced-up with Anton and Russell for the cdc design and implementation.
>> Here are changes to get deleted rows in MVP.
>>
>> We will leverage the `_deleted` metadata column for both pos deletes and
>> eq deletes. This eliminates limitations of the original design. Especially,
>> instead of emitting equality deletes directly as cdc deleted rows, we
>> resolve the eq deletes to actual deleted rows and emit them as CDC delete
>> rows. For example, an eq delete may delete two data rows. We will emit the
>> 2 actual deleted rows.
>>
>> We change the design so that we emit all deleted(pos and eq) rows
>> together in the same format. This is simpler and more efficient than the
>> original design.
>> 1. We don't have to output identifier fields.
>> 2. Downstream tables can write cdc deleted rows directly as an eq deletes
>> without using "merge".
>> 3. It is easier to reconstruct the update in phase 2.
>>
>> The downside is that it is expensive for certain use cases. For example,
>> it has to scan all data files to resolve global eq deletes. We can try to
>> solve this by providing an option to emit eq deletes rows directly in the
>> future. Please refer to
>> https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709
>> for more details.
>>
>> Let us know if you have any feedback. Thanks.
>>
>> Yufei
>>
>>
>> On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <fl...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>>
>>> Thanks for the joining and discussion in the sync-up last Friday. We’ve
>>> got a consensus on several items:
>>>
>>>    1.
>>>
>>>    The snapshot granularity CDC generation is useful, and will cover a
>>>    wide range of use cases. Sub-snapshot granularity is out of scope at this
>>>    moment, which needs a separate proposal.
>>>    2.
>>>
>>>    For COW, we should treat all rows from the deleted data files as the
>>>    deleted rows, which is more efficient, and more importantly, it doesn’t
>>>    yield wrong results when duplicate rows exist.
>>>    3.
>>>
>>>    Creating a minimum viable product (MVP) according to the current
>>>    design
>>>
>>>
>>> Thanks Anton for the comments in
>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.
>>>
>>>
>>> With the meetup and Anton's comment, here is the plan to move forward. We
>>> split the implementation into two phases. The minimum viable product (MVP)
>>> in phase 1 will have most things from the proposal with the following
>>> adjustments.
>>>
>>>
>>> *Phase 1 (MVP)*
>>>
>>>    1.
>>>
>>>    To emit delete and insert CDC records only
>>>    2.
>>>
>>>    Don’t join for equality deletes. To emit equality deletes directly
>>>    as deleted rows per Anton’s suggestion. Otherwise, we need to join the
>>>    whole table with the equality delete files, which is not scalable. We will
>>>    evaluate the cost of the join in phase 2 and support it probably, or the
>>>    other way to approach it.
>>>    3.
>>>
>>>    COW: to output all rows in the deleted data files as the deleted
>>>    rows, to output all rows in the added data files as the inserted rows. We
>>>    will figure out a more scalable way to filter out unchanged rows in phase
>>>    2. The approach of joining on the all columns has two issues:
>>>    1.
>>>
>>>       Not scalable, think about a table with more than 100 columns
>>>       2.
>>>
>>>       Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
>>>       data files marked as deleted, then we got new data files with two same rows
>>>       (1, Amy, 20) and (1, Amy, 20).
>>>       4.
>>>
>>>    User interface: to create an action to generate CDC records instead
>>>    of a procedure, an action can return a dataframe, which is more convenient
>>>    than an array of InternalRow produced by a Spark procedure.
>>>
>>> *Phase 2*
>>>
>>>    1.
>>>
>>>    Enable update reconstruction to emit CDC update records.
>>>    2.
>>>
>>>    COW: to filter out unchanged rows.
>>>    3.
>>>
>>>    User Interface: to support the metatable, which will enable more use
>>>    cases, e.g., streaming use case.
>>>
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
>>> <ao...@apple.com.invalid> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Based on Yufei’s design doc and what we discussed during the sync, I
>>>> shared my thoughts on what can be efficiently supported right now.
>>>>
>>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>>>>
>>>> I’d be interested to learn more about specific use cases that would
>>>> violate the assumptions I listed in my comment. If you have such a use case
>>>> in mind, please, comment on the issue.
>>>>
>>>> - Anton
>>>>
>>>>
>>>> On 24 Feb 2022, at 14:57, Yufei Gu <fl...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am
>>>> PST due to an unexpected event. The meeting link will be the same,
>>>> meet.google.com/vam-cmfx-feo. Thanks!
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <fl...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> It's great to see a lot of interest in the design.
>>>>> We are planning to have a meeting to discuss Iceberg CDC design on
>>>>> Friday(2/25) 9-10am PST. The meeting link is
>>>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as
>>>>> well as open questions. The meeting will be recorded.
>>>>>
>>>>>
>>>>> Best,
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Oh cool, I have not had a chance to review much of this, but I was
>>>>>> having a conversation with a team which wanted similar features for a table
>>>>>> so excited to see folks working on it 👍
>>>>>>
>>>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <fl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi team,
>>>>>>>
>>>>>>> We propose a way to generate the CDC records from the Iceberg
>>>>>>> tables. It is an approach without table spec change and write-time logging.
>>>>>>> It will cover the majority of CDC use cases, but no guarantee to all of
>>>>>>> them. We believe it's a good start point to approach CDC in the Iceberg.
>>>>>>> Any feedback is welcome!
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>

Re: Change Data Capture for Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
Hi everyone,

Here is the Change Data Capture update. I posted a draft PR(
https://github.com/apache/iceberg/pull/4539) 2 weeks ago, and got lots of
reviews. Thank you all for the review. Based on the feedback, we will move
forward with the approach and fire separated formal PRs. We are also
planning to have a meeting to share the general idea of the approach, and
next steps. Looking forward to seeing you there. Here is the meeting infor.

Change Data Capture for Iceberg
Friday, April 29 · 9:00 – 10:00am
Google Meet joining info
Video call link: https://meet.google.com/pjv-cspg-xos

Best,

Yufei

`This is not a contribution`


On Tue, Mar 29, 2022 at 4:33 PM Yufei Gu <fl...@gmail.com> wrote:

> Synced-up with Anton and Russell for the cdc design and implementation.
> Here are changes to get deleted rows in MVP.
>
> We will leverage the `_deleted` metadata column for both pos deletes and
> eq deletes. This eliminates limitations of the original design. Especially,
> instead of emitting equality deletes directly as cdc deleted rows, we
> resolve the eq deletes to actual deleted rows and emit them as CDC delete
> rows. For example, an eq delete may delete two data rows. We will emit the
> 2 actual deleted rows.
>
> We change the design so that we emit all deleted(pos and eq) rows together
> in the same format. This is simpler and more efficient than the original
> design.
> 1. We don't have to output identifier fields.
> 2. Downstream tables can write cdc deleted rows directly as an eq deletes
> without using "merge".
> 3. It is easier to reconstruct the update in phase 2.
>
> The downside is that it is expensive for certain use cases. For example,
> it has to scan all data files to resolve global eq deletes. We can try to
> solve this by providing an option to emit eq deletes rows directly in the
> future. Please refer to
> https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709 for
> more details.
>
> Let us know if you have any feedback. Thanks.
>
> Yufei
>
>
> On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <fl...@gmail.com> wrote:
>
>> Hi everyone,
>>
>>
>> Thanks for the joining and discussion in the sync-up last Friday. We’ve
>> got a consensus on several items:
>>
>>    1.
>>
>>    The snapshot granularity CDC generation is useful, and will cover a
>>    wide range of use cases. Sub-snapshot granularity is out of scope at this
>>    moment, which needs a separate proposal.
>>    2.
>>
>>    For COW, we should treat all rows from the deleted data files as the
>>    deleted rows, which is more efficient, and more importantly, it doesn’t
>>    yield wrong results when duplicate rows exist.
>>    3.
>>
>>    Creating a minimum viable product (MVP) according to the current
>>    design
>>
>>
>> Thanks Anton for the comments in
>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.
>>
>>
>> With the meetup and Anton's comment, here is the plan to move forward. We
>> split the implementation into two phases. The minimum viable product (MVP)
>> in phase 1 will have most things from the proposal with the following
>> adjustments.
>>
>>
>> *Phase 1 (MVP)*
>>
>>    1.
>>
>>    To emit delete and insert CDC records only
>>    2.
>>
>>    Don’t join for equality deletes. To emit equality deletes directly as
>>    deleted rows per Anton’s suggestion. Otherwise, we need to join the whole
>>    table with the equality delete files, which is not scalable. We will
>>    evaluate the cost of the join in phase 2 and support it probably, or the
>>    other way to approach it.
>>    3.
>>
>>    COW: to output all rows in the deleted data files as the deleted
>>    rows, to output all rows in the added data files as the inserted rows. We
>>    will figure out a more scalable way to filter out unchanged rows in phase
>>    2. The approach of joining on the all columns has two issues:
>>    1.
>>
>>       Not scalable, think about a table with more than 100 columns
>>       2.
>>
>>       Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
>>       data files marked as deleted, then we got new data files with two same rows
>>       (1, Amy, 20) and (1, Amy, 20).
>>       4.
>>
>>    User interface: to create an action to generate CDC records instead
>>    of a procedure, an action can return a dataframe, which is more convenient
>>    than an array of InternalRow produced by a Spark procedure.
>>
>> *Phase 2*
>>
>>    1.
>>
>>    Enable update reconstruction to emit CDC update records.
>>    2.
>>
>>    COW: to filter out unchanged rows.
>>    3.
>>
>>    User Interface: to support the metatable, which will enable more use
>>    cases, e.g., streaming use case.
>>
>>
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
>> <ao...@apple.com.invalid> wrote:
>>
>>> Hey folks,
>>>
>>> Based on Yufei’s design doc and what we discussed during the sync, I
>>> shared my thoughts on what can be efficiently supported right now.
>>>
>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>>>
>>> I’d be interested to learn more about specific use cases that would
>>> violate the assumptions I listed in my comment. If you have such a use case
>>> in mind, please, comment on the issue.
>>>
>>> - Anton
>>>
>>>
>>> On 24 Feb 2022, at 14:57, Yufei Gu <fl...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>>
>>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST
>>> due to an unexpected event. The meeting link will be the same,
>>> meet.google.com/vam-cmfx-feo. Thanks!
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>>
>>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <fl...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> It's great to see a lot of interest in the design.
>>>> We are planning to have a meeting to discuss Iceberg CDC design on
>>>> Friday(2/25) 9-10am PST. The meeting link is
>>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as
>>>> well as open questions. The meeting will be recorded.
>>>>
>>>>
>>>> Best,
>>>> Yufei
>>>>
>>>>
>>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Oh cool, I have not had a chance to review much of this, but I was
>>>>> having a conversation with a team which wanted similar features for a table
>>>>> so excited to see folks working on it 👍
>>>>>
>>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <fl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi team,
>>>>>>
>>>>>> We propose a way to generate the CDC records from the Iceberg tables.
>>>>>> It is an approach without table spec change and write-time logging. It will
>>>>>> cover the majority of CDC use cases, but no guarantee to all of them. We
>>>>>> believe it's a good start point to approach CDC in the Iceberg. Any
>>>>>> feedback is welcome!
>>>>>>
>>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Yufei
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>

Re: Change Data Capture for Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
Synced-up with Anton and Russell for the cdc design and implementation.
Here are changes to get deleted rows in MVP.

We will leverage the `_deleted` metadata column for both pos deletes and eq
deletes. This eliminates limitations of the original design. Especially,
instead of emitting equality deletes directly as cdc deleted rows, we
resolve the eq deletes to actual deleted rows and emit them as CDC delete
rows. For example, an eq delete may delete two data rows. We will emit the
2 actual deleted rows.

We change the design so that we emit all deleted(pos and eq) rows together
in the same format. This is simpler and more efficient than the original
design.
1. We don't have to output identifier fields.
2. Downstream tables can write cdc deleted rows directly as an eq deletes
without using "merge".
3. It is easier to reconstruct the update in phase 2.

The downside is that it is expensive for certain use cases. For example, it
has to scan all data files to resolve global eq deletes. We can try to
solve this by providing an option to emit eq deletes rows directly in the
future. Please refer to
https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709 for
more details.

Let us know if you have any feedback. Thanks.

Yufei


On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <fl...@gmail.com> wrote:

> Hi everyone,
>
>
> Thanks for the joining and discussion in the sync-up last Friday. We’ve
> got a consensus on several items:
>
>    1.
>
>    The snapshot granularity CDC generation is useful, and will cover a
>    wide range of use cases. Sub-snapshot granularity is out of scope at this
>    moment, which needs a separate proposal.
>    2.
>
>    For COW, we should treat all rows from the deleted data files as the
>    deleted rows, which is more efficient, and more importantly, it doesn’t
>    yield wrong results when duplicate rows exist.
>    3.
>
>    Creating a minimum viable product (MVP) according to the current design
>
>
> Thanks Anton for the comments in
> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.
>
>
> With the meetup and Anton's comment, here is the plan to move forward. We
> split the implementation into two phases. The minimum viable product (MVP)
> in phase 1 will have most things from the proposal with the following
> adjustments.
>
>
> *Phase 1 (MVP)*
>
>    1.
>
>    To emit delete and insert CDC records only
>    2.
>
>    Don’t join for equality deletes. To emit equality deletes directly as
>    deleted rows per Anton’s suggestion. Otherwise, we need to join the whole
>    table with the equality delete files, which is not scalable. We will
>    evaluate the cost of the join in phase 2 and support it probably, or the
>    other way to approach it.
>    3.
>
>    COW: to output all rows in the deleted data files as the deleted rows,
>    to output all rows in the added data files as the inserted rows. We will
>    figure out a more scalable way to filter out unchanged rows in phase 2. The
>    approach of joining on the all columns has two issues:
>    1.
>
>       Not scalable, think about a table with more than 100 columns
>       2.
>
>       Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
>       data files marked as deleted, then we got new data files with two same rows
>       (1, Amy, 20) and (1, Amy, 20).
>       4.
>
>    User interface: to create an action to generate CDC records instead of
>    a procedure, an action can return a dataframe, which is more convenient
>    than an array of InternalRow produced by a Spark procedure.
>
> *Phase 2*
>
>    1.
>
>    Enable update reconstruction to emit CDC update records.
>    2.
>
>    COW: to filter out unchanged rows.
>    3.
>
>    User Interface: to support the metatable, which will enable more use
>    cases, e.g., streaming use case.
>
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
>
>> Hey folks,
>>
>> Based on Yufei’s design doc and what we discussed during the sync, I
>> shared my thoughts on what can be efficiently supported right now.
>>
>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>>
>> I’d be interested to learn more about specific use cases that would
>> violate the assumptions I listed in my comment. If you have such a use case
>> in mind, please, comment on the issue.
>>
>> - Anton
>>
>>
>> On 24 Feb 2022, at 14:57, Yufei Gu <fl...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST
>> due to an unexpected event. The meeting link will be the same,
>> meet.google.com/vam-cmfx-feo. Thanks!
>>
>> Best,
>>
>> Yufei
>>
>>
>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <fl...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> It's great to see a lot of interest in the design.
>>> We are planning to have a meeting to discuss Iceberg CDC design on
>>> Friday(2/25) 9-10am PST. The meeting link is
>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as
>>> well as open questions. The meeting will be recorded.
>>>
>>>
>>> Best,
>>> Yufei
>>>
>>>
>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Oh cool, I have not had a chance to review much of this, but I was
>>>> having a conversation with a team which wanted similar features for a table
>>>> so excited to see folks working on it 👍
>>>>
>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <fl...@gmail.com> wrote:
>>>>
>>>>> Hi team,
>>>>>
>>>>> We propose a way to generate the CDC records from the Iceberg tables.
>>>>> It is an approach without table spec change and write-time logging. It will
>>>>> cover the majority of CDC use cases, but no guarantee to all of them. We
>>>>> believe it's a good start point to approach CDC in the Iceberg. Any
>>>>> feedback is welcome!
>>>>>
>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>>
>>>>> Best,
>>>>>
>>>>> Yufei
>>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>

Re: Change Data Capture for Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
Hi everyone,


Thanks for the joining and discussion in the sync-up last Friday. We’ve got
a consensus on several items:

   1.

   The snapshot granularity CDC generation is useful, and will cover a wide
   range of use cases. Sub-snapshot granularity is out of scope at this
   moment, which needs a separate proposal.
   2.

   For COW, we should treat all rows from the deleted data files as the
   deleted rows, which is more efficient, and more importantly, it doesn’t
   yield wrong results when duplicate rows exist.
   3.

   Creating a minimum viable product (MVP) according to the current design


Thanks Anton for the comments in
https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.


With the meetup and Anton's comment, here is the plan to move forward. We
split the implementation into two phases. The minimum viable product (MVP)
in phase 1 will have most things from the proposal with the following
adjustments.


*Phase 1 (MVP)*

   1.

   To emit delete and insert CDC records only
   2.

   Don’t join for equality deletes. To emit equality deletes directly as
   deleted rows per Anton’s suggestion. Otherwise, we need to join the whole
   table with the equality delete files, which is not scalable. We will
   evaluate the cost of the join in phase 2 and support it probably, or the
   other way to approach it.
   3.

   COW: to output all rows in the deleted data files as the deleted rows,
   to output all rows in the added data files as the inserted rows. We will
   figure out a more scalable way to filter out unchanged rows in phase 2. The
   approach of joining on the all columns has two issues:
   1.

      Not scalable, think about a table with more than 100 columns
      2.

      Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
      data files marked as deleted, then we got new data files with
two same rows
      (1, Amy, 20) and (1, Amy, 20).
      4.

   User interface: to create an action to generate CDC records instead of a
   procedure, an action can return a dataframe, which is more convenient than
   an array of InternalRow produced by a Spark procedure.

*Phase 2*

   1.

   Enable update reconstruction to emit CDC update records.
   2.

   COW: to filter out unchanged rows.
   3.

   User Interface: to support the metatable, which will enable more use
   cases, e.g., streaming use case.


Best,

Yufei

`This is not a contribution`


On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> Hey folks,
>
> Based on Yufei’s design doc and what we discussed during the sync, I
> shared my thoughts on what can be efficiently supported right now.
>
> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>
> I’d be interested to learn more about specific use cases that would
> violate the assumptions I listed in my comment. If you have such a use case
> in mind, please, comment on the issue.
>
> - Anton
>
>
> On 24 Feb 2022, at 14:57, Yufei Gu <fl...@gmail.com> wrote:
>
> Hi everyone,
>
> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST
> due to an unexpected event. The meeting link will be the same,
> meet.google.com/vam-cmfx-feo. Thanks!
>
> Best,
>
> Yufei
>
>
> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <fl...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> It's great to see a lot of interest in the design.
>> We are planning to have a meeting to discuss Iceberg CDC design on
>> Friday(2/25) 9-10am PST. The meeting link is meet.google.com/vam-cmfx-feo.
>> We will talk about the general idea, as well as open questions. The meeting
>> will be recorded.
>>
>>
>> Best,
>> Yufei
>>
>>
>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Oh cool, I have not had a chance to review much of this, but I was
>>> having a conversation with a team which wanted similar features for a table
>>> so excited to see folks working on it 👍
>>>
>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <fl...@gmail.com> wrote:
>>>
>>>> Hi team,
>>>>
>>>> We propose a way to generate the CDC records from the Iceberg tables.
>>>> It is an approach without table spec change and write-time logging. It will
>>>> cover the majority of CDC use cases, but no guarantee to all of them. We
>>>> believe it's a good start point to approach CDC in the Iceberg. Any
>>>> feedback is welcome!
>>>>
>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>