You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Himanshu Rathore <hi...@zomato.com.INVALID> on 2021/03/23 18:27:11 UTC

When is the next release of Iceberg ?

We are planning for use Flink + Iceberg for syncing mysql binlog's via debezium and its seams of things are dependent on next release.

Re: When is the next release of Iceberg ?

Posted by Huadong Liu <hu...@gmail.com>.

Hi openinx,

With https://github.com/apache/iceberg/pull/2303 and a potential sequence number based fix for https://github.com/apache/iceberg/issues/2308, I don't see a harder blocker to test out row-level deletions. Please correct if anything else in https://github.com/apache/iceberg/milestone/4 is a must have.

Is it possible to separate flink+iceberg CDC changes and low-level deletions in future releases so that the community can have V2 earlier?

Thanks,
Huadong

On 2021/03/24 02:34:23, OpenInx <op...@gmail.com> wrote: 
> Hi Himanshu
> 
> Thanks for the email,  currently we flink+iceberg support writing CDC
> events into apache iceberg table by flink datastream API, besides the
> spark/presto/hive could read those events in batch job.
> 
> But there are still some issues that we do not finish yet:
> 
> 1.  Expose the iceberg v2 to end users.  The row-level delete feature is
> actually built on the iceberg format v2,  there are still some blockers
> that we need to fix (pls see the document
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit),
> we iceberg team will need some resources to resolve them.
> 2.  As we know the CDC events depend on iceberg primary key identification
> (Then we could define mysql_cdc sql table by using primary key cause) I saw
> Jack Ye has published a PR to this
> https://github.com/apache/iceberg/pull/2354,  I will review it today.
> 3.  The CDC writers will produce many small files inevitably as the
> periodic checkpoints go on,  so for the real production env we must provide
> the ability to rewrite small files into larger files ( compaction action)
> .  There are few PRs needing to be reviewing:
>        a.  https://github.com/apache/iceberg/pull/2303/files
>        b.  https://github.com/apache/iceberg/pull/2294
>        c.  https://github.com/apache/iceberg/pull/2216
> 
> I think it's better to resolve all those issues before we put the
> production data into iceberg ( syncing mysql binlog via debezium).  I saw
> the last sync notes saying  the next release 0.12.0 would be released in
> end of this month ideally (
> https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E)
> ,  I think that  that deadline is too tight.  In my mind,  if the release
> 0.12.0 won't expose the format v2 to end users, then what are the core
> features that we want to release ?  If the features that we plan to release
> are not major ones,  then how about releasing the 0.11.2 ?
> 
> According to my understanding of the needs of community users, the vast
> majority of iceberg users have high expectations for format v2. I think we
> may need to raise the v2 exposure to a higher priority so that our users
> can do the whole PoC tests earlier.
> 
> 
> 
> On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore
> <hi...@zomato.com.invalid> wrote:
> 
> > We are planning for use Flink + Iceberg for syncing mysql binlog's via
> > debezium and its seams of things are dependent on next release.
> >
>

Re: When is the next release of Iceberg ?

Posted by OpenInx <op...@gmail.com>.

Hi Himanshu

If you want to try the flink + iceberg fo syncing mysql binlog to iceberg
table,  you might be interested in those PRs:

1. https://github.com/apache/iceberg/pull/2410
2. https://github.com/apache/iceberg/pull/2303

On Wed, Mar 24, 2021 at 10:34 AM OpenInx <op...@gmail.com> wrote:

> Hi Himanshu
>
> Thanks for the email,  currently we flink+iceberg support writing CDC
> events into apache iceberg table by flink datastream API, besides the
> spark/presto/hive could read those events in batch job.
>
> But there are still some issues that we do not finish yet:
>
> 1.  Expose the iceberg v2 to end users.  The row-level delete feature is
> actually built on the iceberg format v2,  there are still some blockers
> that we need to fix (pls see the document
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit),
> we iceberg team will need some resources to resolve them.
> 2.  As we know the CDC events depend on iceberg primary key
> identification  (Then we could define mysql_cdc sql table by using primary
> key cause) I saw Jack Ye has published a PR to this
> https://github.com/apache/iceberg/pull/2354,  I will review it today.
> 3.  The CDC writers will produce many small files inevitably as the
> periodic checkpoints go on,  so for the real production env we must provide
> the ability to rewrite small files into larger files ( compaction action)
> .  There are few PRs needing to be reviewing:
>        a.  https://github.com/apache/iceberg/pull/2303/files
>        b.  https://github.com/apache/iceberg/pull/2294
>        c.  https://github.com/apache/iceberg/pull/2216
>
> I think it's better to resolve all those issues before we put the
> production data into iceberg ( syncing mysql binlog via debezium).  I saw
> the last sync notes saying  the next release 0.12.0 would be released in
> end of this month ideally (
> https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E)
> ,  I think that  that deadline is too tight.  In my mind,  if the release
> 0.12.0 won't expose the format v2 to end users, then what are the core
> features that we want to release ?  If the features that we plan to release
> are not major ones,  then how about releasing the 0.11.2 ?
>
> According to my understanding of the needs of community users, the vast
> majority of iceberg users have high expectations for format v2. I think we
> may need to raise the v2 exposure to a higher priority so that our users
> can do the whole PoC tests earlier.
>
>
>
> On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore
> <hi...@zomato.com.invalid> wrote:
>
>> We are planning for use Flink + Iceberg for syncing mysql binlog's via
>> debezium and its seams of things are dependent on next release.
>>
>

Re: When is the next release of Iceberg ?

Posted by OpenInx <op...@gmail.com>.

Hi Himanshu

Thanks for the email,  currently we flink+iceberg support writing CDC
events into apache iceberg table by flink datastream API, besides the
spark/presto/hive could read those events in batch job.

But there are still some issues that we do not finish yet:

1.  Expose the iceberg v2 to end users.  The row-level delete feature is
actually built on the iceberg format v2,  there are still some blockers
that we need to fix (pls see the document
https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit),
we iceberg team will need some resources to resolve them.
2.  As we know the CDC events depend on iceberg primary key identification
(Then we could define mysql_cdc sql table by using primary key cause) I saw
Jack Ye has published a PR to this
https://github.com/apache/iceberg/pull/2354,  I will review it today.
3.  The CDC writers will produce many small files inevitably as the
periodic checkpoints go on,  so for the real production env we must provide
the ability to rewrite small files into larger files ( compaction action)
.  There are few PRs needing to be reviewing:
       a.  https://github.com/apache/iceberg/pull/2303/files
       b.  https://github.com/apache/iceberg/pull/2294
       c.  https://github.com/apache/iceberg/pull/2216

I think it's better to resolve all those issues before we put the
production data into iceberg ( syncing mysql binlog via debezium).  I saw
the last sync notes saying  the next release 0.12.0 would be released in
end of this month ideally (
https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E)
,  I think that  that deadline is too tight.  In my mind,  if the release
0.12.0 won't expose the format v2 to end users, then what are the core
features that we want to release ?  If the features that we plan to release
are not major ones,  then how about releasing the 0.11.2 ?

According to my understanding of the needs of community users, the vast
majority of iceberg users have high expectations for format v2. I think we
may need to raise the v2 exposure to a higher priority so that our users
can do the whole PoC tests earlier.

On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore
<hi...@zomato.com.invalid> wrote:

> We are planning for use Flink + Iceberg for syncing mysql binlog's via
> debezium and its seams of things are dependent on next release.
>