You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by OpenInx <op...@gmail.com> on 2020/12/17 07:03:36 UTC

What's the time to expose iceberg format v2 to end users ?

Hi

I wrote this email to align with the community about the time to expose
format v2 to end users.

In iceberg format v2,  we've accomplished the row-level delete.  It's
designed for two user cases:

1.  Execute a single query to update or delete lots of rows.  It's a
typical batch update/delete job,  which is suitable for GDPR  or the case
that we want to correct the wrong data.
2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that the
upper layer  compute engines could  analyze the change log in minutes.
It's almost ready in the current master branch for flink integration.


I'm not quite sure what's the blocker about the iceberg format v2 now.  I'd
love to resolve those blockers if there're some.

Thanks.

Re: What's the time to expose iceberg format v2 to end users ?

Posted by "Jun H." <ju...@gmail.com>.

I updated the doc with few changes related to partition evolution.

Thanks.


Jun

On Tue, Dec 22, 2020 at 5:06 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> Thanks, Yan!
>
> To summarize that doc a bit, the main blockers are:
> * Finish updating the spec for NaN counters and behavior
> * Fix the issue with partition transforms and values before 1970 (#1680)
> * Partition evolution: Add lastPartitionFieldId to table metadata and update docs
> * Add order id column to manifest files
> * Track the schema of each snapshot
>
> Only the last one is a somewhat large task, but even that should be fairly quick. I think we can take care of those in the first couple months of 2021 after the 0.11.0 release is out.
>
> On Fri, Dec 18, 2020 at 12:59 AM OpenInx <op...@gmail.com> wrote:
>>
>> Thanks Yan for the document,  I will take a look at it, and see what I can do.
>>
>> On Fri, Dec 18, 2020 at 3:38 AM Yan Yan <yy...@gmail.com> wrote:
>>>
>>> Hi OpenInx,
>>>
>>> Thanks for bringing this up. I am currently working on Format v2 blocking tasks, and am maintaining a full list of blocking tasks with their description and current status here after speaking with Ryan a while ago, which covers all open issues listed in the github milestone plus some others brought up by people during community sync. It would be great if you are interested in collaborating/code reviewing!
>>>
>>> Everyone please feel free to let me know/update the doc if you see any item missing/described inaccurately.
>>>
>>> Thanks,
>>> Yan
>>>
>>> On Wed, Dec 16, 2020 at 11:03 PM OpenInx <op...@gmail.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> I wrote this email to align with the community about the time to expose format v2 to end users.
>>>>
>>>> In iceberg format v2,  we've accomplished the row-level delete.  It's designed for two user cases:
>>>>
>>>> 1.  Execute a single query to update or delete lots of rows.  It's a typical batch update/delete job,  which is suitable for GDPR  or the case that we want to correct the wrong data.
>>>> 2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that the upper layer  compute engines could  analyze the change log in minutes.  It's almost ready in the current master branch for flink integration.
>>>>
>>>>
>>>> I'm not quite sure what's the blocker about the iceberg format v2 now.  I'd love to resolve those blockers if there're some.
>>>>
>>>> Thanks.
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: What's the time to expose iceberg format v2 to end users ?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Thanks, Yan!

To summarize that doc a bit, the main blockers are:
* Finish updating the spec for NaN counters and behavior
* Fix the issue with partition transforms and values before 1970 (#1680)
* Partition evolution: Add lastPartitionFieldId to table metadata and
update docs
* Add order id column to manifest files
* Track the schema of each snapshot

Only the last one is a somewhat large task, but even that should be fairly
quick. I think we can take care of those in the first couple months of 2021
after the 0.11.0 release is out.

On Fri, Dec 18, 2020 at 12:59 AM OpenInx <op...@gmail.com> wrote:

> Thanks Yan for the document,  I will take a look at it, and see what I can
> do.
>
> On Fri, Dec 18, 2020 at 3:38 AM Yan Yan <yy...@gmail.com> wrote:
>
>> Hi OpenInx,
>>
>> Thanks for bringing this up. I am currently working on Format v2 blocking
>> tasks, and am maintaining a full list of blocking tasks with their
>> description and current status here
>> <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit?usp=sharing> after
>> speaking with Ryan a while ago, which covers all open issues listed in the
>> github milestone <https://github.com/apache/iceberg/milestone/7> plus
>> some others brought up by people during community sync. It would be great
>> if you are interested in collaborating/code reviewing!
>>
>> Everyone please feel free to let me know/update the doc if you see any
>> item missing/described inaccurately.
>>
>> Thanks,
>> Yan
>>
>> On Wed, Dec 16, 2020 at 11:03 PM OpenInx <op...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I wrote this email to align with the community about the time to expose
>>> format v2 to end users.
>>>
>>> In iceberg format v2,  we've accomplished the row-level delete.  It's
>>> designed for two user cases:
>>>
>>> 1.  Execute a single query to update or delete lots of rows.  It's a
>>> typical batch update/delete job,  which is suitable for GDPR  or the case
>>> that we want to correct the wrong data.
>>> 2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that
>>> the upper layer  compute engines could  analyze the change log in minutes.
>>> It's almost ready in the current master branch for flink integration.
>>>
>>>
>>> I'm not quite sure what's the blocker about the iceberg format v2 now.
>>> I'd love to resolve those blockers if there're some.
>>>
>>> Thanks.
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: What's the time to expose iceberg format v2 to end users ?

Posted by OpenInx <op...@gmail.com>.

Thanks Yan for the document,  I will take a look at it, and see what I can
do.

On Fri, Dec 18, 2020 at 3:38 AM Yan Yan <yy...@gmail.com> wrote:

> Hi OpenInx,
>
> Thanks for bringing this up. I am currently working on Format v2 blocking
> tasks, and am maintaining a full list of blocking tasks with their
> description and current status here
> <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit?usp=sharing> after
> speaking with Ryan a while ago, which covers all open issues listed in the
> github milestone <https://github.com/apache/iceberg/milestone/7> plus
> some others brought up by people during community sync. It would be great
> if you are interested in collaborating/code reviewing!
>
> Everyone please feel free to let me know/update the doc if you see any
> item missing/described inaccurately.
>
> Thanks,
> Yan
>
> On Wed, Dec 16, 2020 at 11:03 PM OpenInx <op...@gmail.com> wrote:
>
>> Hi
>>
>> I wrote this email to align with the community about the time to expose
>> format v2 to end users.
>>
>> In iceberg format v2,  we've accomplished the row-level delete.  It's
>> designed for two user cases:
>>
>> 1.  Execute a single query to update or delete lots of rows.  It's a
>> typical batch update/delete job,  which is suitable for GDPR  or the case
>> that we want to correct the wrong data.
>> 2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that
>> the upper layer  compute engines could  analyze the change log in minutes.
>> It's almost ready in the current master branch for flink integration.
>>
>>
>> I'm not quite sure what's the blocker about the iceberg format v2 now.
>> I'd love to resolve those blockers if there're some.
>>
>> Thanks.
>>
>

Re: What's the time to expose iceberg format v2 to end users ?

Posted by Yan Yan <yy...@gmail.com>.

Hi OpenInx,

Thanks for bringing this up. I am currently working on Format v2 blocking
tasks, and am maintaining a full list of blocking tasks with their
description and current status here
<https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit?usp=sharing
> after
speaking with Ryan a while ago, which covers all open issues listed in the
github milestone <https://github.com/apache/iceberg/milestone/7> plus some
others brought up by people during community sync. It would be great if you
are interested in collaborating/code reviewing!

Everyone please feel free to let me know/update the doc if you see any item
missing/described inaccurately.

Thanks,
Yan

On Wed, Dec 16, 2020 at 11:03 PM OpenInx <op...@gmail.com> wrote:

> Hi
>
> I wrote this email to align with the community about the time to expose
> format v2 to end users.
>
> In iceberg format v2,  we've accomplished the row-level delete.  It's
> designed for two user cases:
>
> 1.  Execute a single query to update or delete lots of rows.  It's a
> typical batch update/delete job,  which is suitable for GDPR  or the case
> that we want to correct the wrong data.
> 2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that
> the upper layer  compute engines could  analyze the change log in minutes.
> It's almost ready in the current master branch for flink integration.
>
>
> I'm not quite sure what's the blocker about the iceberg format v2 now.
> I'd love to resolve those blockers if there're some.
>
> Thanks.
>