You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Qinhua Yan <Qi...@twosigma.com> on 2021/08/27 18:47:50 UTC

Enhanced JdbcCatalog with Namespace Management

Hi there,


We'd like to share our JdbcCatalog impl with the community and welcome any discussion.

We are aware of the existing JdbcCatalog impl<https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>, however, it has some feature gaps and doesn't work for our use case. Therefore, we implemented a SQL-database backed Catalog with the following enhancements.

1.       Namespace management and configuration

*       Each namespace can be backed by a different S3 bucket. This allows fine grained access control at the namespace level.

*       At namespace creation time, users can choose either 1) use a pre-existing bucket; 2) let the Catalog create a new bucket.

*       Isolate logical TableIdentifiers from physical S3 locations.

*       Support rename table within the same namespace without touching S3.

2.       Support various kinds of databases

*       Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL semantics.

*       Easy to support different kinds of SQL without touching the core Catalog code.

*       Provide database initialization scripts for Postgres.



This Catalog implementation can be easily extended to support some advanced features such as undelete tables and namespace-backed-by-multiple-backends.



Any comments and discussions are welcomed!


Thank you!
Qinhua Yan

RE: Enhanced JdbcCatalog with Namespace Management

Posted by Qinhua Yan <Qi...@twosigma.com>.

I’m really glad to see there are so many interests! ☺

I cannot wait to create the PR but honestly we just had the first working Catalog. I wanted to learn what people think. I’ll post in this thread when PR is ready.

Before that, let me try to answer some of the questions.

·       We created a separated sql table for namespace to facilitate easier and faster lookups/configuration.

o   Each namespace has a location uri that points to a S3 bucket, which is the default location of tables created under that namespace.

o   It is possible for a namespace to have multiple backends.

·       Our users are very familiar to namespace-level permission management. Most of namespaces are restrict (both read and write) to only certain group of people.

o   Users with access to the Catalog DB are catalog owners. They are not necessarily, and are often not, the data owners. For example, right now the major client of our Catalog is a data-access service. It is the owner of the Catalog but not the Tables.

o   Technically, our Catalog supports rename-across-namespaces in the same way that it supports rename-within-namespace. However, for us this is very rare as explained above.

o   For the same reason, our users are familiar with uri identifiers like “iceberg://mynamespace/mytable”. What I meant by isolation of logical table name from physical is that, user will never see or need to understand “iceberg://mynamespace/mytable-uuid16”. I understand that this is a feature common to existing Catalog impls.

·       SQL compatibility

o   (This almost looks an ad) but Jooq<https://www.jooq.org/> is like a type-safe generic SQL db connector that supports multiple SQL types. In other words, the same Java code implementing a SQL query can work with different types of db.

o   That said, I totally understand your concern about bringing-in additional dependency and we are ok to switch to plain jdbc driver if you want to keep the core code independent.

o   Agree that SQL db initialization can be added to doc.

·       Features not ready but we want to work on next (that will involve the Catalog) include

o   Undelete,

o   and very similar logic as undelete: register pre-existing table to Catalog.

Thanks,
Qinhua Yan

From: Jack Ye <ye...@gmail.com>
Sent: Sunday, August 29, 2021 1:01 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: Enhanced JdbcCatalog with Namespace Management

Hi Qinhua,

+1 for what Ryan says, it would be great to have a PR to analyze the features you list in detail. I have the following questions and comments:

> Namespace management and configuration
This is something we decided to implement in a second iteration when people have a need, so if you have a desire to add namespace feature it would be great to directly add it to the existing JdbcCatalog, I can review in more details when you have the PR out.

> Each namespace can be backed by a different S3 bucket. This allows fine grained access control at the namespace level.
This is common feature across all catalog implementations that supports namespace. Typically a LocationUri is stored as namespace property, which can be used to override default table location in that namespace.

> At namespace creation time, users can choose either 1) use a pre-existing bucket; 2) let the Catalog create a new bucket.
I think we need to make this more generic, likely with the mechanism I described that is used for all other catalog implementations. Bucket is a resource with a maximum cap, so it would not scale well for multi-tenant use case for users with an unbounded number of logical namespaces.

> Isolate logical TableIdentifiers from physical S3 locations.
> Support rename table within the same namespace without touching S3.
Iceberg table has UUID, I think these are achievable directly based on Iceberg catalog and storage design. We should also be bale to rename table across namespace. Could you describe in more details for what feature is added here?

> Support various kinds of databases
Is there anything that the current implementation not support? I don't think it uses any SQL dialect that is incompatible across databases, but maybe I overlooked something here.

> Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL semantics.
I think we should evaluate a little bit about the difference of this library and the current implementation. I checked that the license of the library is fine, could you describe a bit why this is a better choice than the existing implementation?
We typically do not want to have too many third party dependencies. Given JdbcCatalog is in the core library, we should be very careful when adding new dependencies.

> Provide database initialization scripts for Postgres.
I think infrastructure setup like database initialization is out of scope of Iceberg library. We can add this as a part of the JDBC documentation.

Best,
Jack Ye

On Fri, Aug 27, 2021 at 12:54 PM Ryan Blue <bl...@tabular.io>> wrote:
Qinhua, thanks for sharing this. It sounds great to add more features to the JDBC catalog.

Could you share a link to the implementation or a PR? I have lots more questions like how you implemented namespaces, but those can probably be answered by looking at the code if you're able to share it.

Thanks!

Ryan

On Fri, Aug 27, 2021 at 11:48 AM Qinhua Yan <Qi...@twosigma.com>> wrote:

Hi there,

We’d like to share our JdbcCatalog impl with the community and welcome any discussion.

We are aware of the existing JdbcCatalog impl<https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>, however, it has some feature gaps and doesn’t work for our use case. Therefore, we implemented a SQL-database backed Catalog with the following enhancements.

1.       Namespace management and configuration

•       Each namespace can be backed by a different S3 bucket. This allows fine grained access control at the namespace level.

•       At namespace creation time, users can choose either 1) use a pre-existing bucket; 2) let the Catalog create a new bucket.

•       Isolate logical TableIdentifiers from physical S3 locations.

•       Support rename table within the same namespace without touching S3.

2.       Support various kinds of databases

•       Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL semantics.

•       Easy to support different kinds of SQL without touching the core Catalog code.

•       Provide database initialization scripts for Postgres.

This Catalog implementation can be easily extended to support some advanced features such as undelete tables and namespace-backed-by-multiple-backends.

Any comments and discussions are welcomed!

Thank you!
Qinhua Yan

--
Ryan Blue
Tabular

RE: Enhanced JdbcCatalog with Namespace Management

Posted by Qinhua Yan <Qi...@twosigma.com>.

Hi there,
Please find the poc pull request here:
https://github.com/apache/iceberg/pull/3177

Thanks,
Qinhua Yan

From: Qinhua Yan
Sent: Monday, August 30, 2021 10:22 AM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: RE: Enhanced JdbcCatalog with Namespace Management

I’m really glad to see there are so many interests! ☺

I cannot wait to create the PR but honestly we just had the first working Catalog. I wanted to learn what people think. I’ll post in this thread when PR is ready.

Before that, let me try to answer some of the questions.

·       We created a separated sql table for namespace to facilitate easier and faster lookups/configuration.

o   Each namespace has a location uri that points to a S3 bucket, which is the default location of tables created under that namespace.

o   It is possible for a namespace to have multiple backends.

·       Our users are very familiar to namespace-level permission management. Most of namespaces are restrict (both read and write) to only certain group of people.

o   Users with access to the Catalog DB are catalog owners. They are not necessarily, and are often not, the data owners. For example, right now the major client of our Catalog is a data-access service. It is the owner of the Catalog but not the Tables.

o   Technically, our Catalog supports rename-across-namespaces in the same way that it supports rename-within-namespace. However, for us this is very rare as explained above.

o   For the same reason, our users are familiar with uri identifiers like “iceberg://mynamespace/mytable”. What I meant by isolation of logical table name from physical is that, user will never see or need to understand “iceberg://mynamespace/mytable-uuid16”. I understand that this is a feature common to existing Catalog impls.

·       SQL compatibility

o   (This almost looks an ad) but Jooq<https://www.jooq.org/> is like a type-safe generic SQL db connector that supports multiple SQL types. In other words, the same Java code implementing a SQL query can work with different types of db.

o   That said, I totally understand your concern about bringing-in additional dependency and we are ok to switch to plain jdbc driver if you want to keep the core code independent.

o   Agree that SQL db initialization can be added to doc.

·       Features not ready but we want to work on next (that will involve the Catalog) include

o   Undelete,

o   and very similar logic as undelete: register pre-existing table to Catalog.

Thanks,
Qinhua Yan

From: Jack Ye <ye...@gmail.com>>
Sent: Sunday, August 29, 2021 1:01 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: Enhanced JdbcCatalog with Namespace Management

Hi Qinhua,

+1 for what Ryan says, it would be great to have a PR to analyze the features you list in detail. I have the following questions and comments:

> Namespace management and configuration
This is something we decided to implement in a second iteration when people have a need, so if you have a desire to add namespace feature it would be great to directly add it to the existing JdbcCatalog, I can review in more details when you have the PR out.

> Each namespace can be backed by a different S3 bucket. This allows fine grained access control at the namespace level.
This is common feature across all catalog implementations that supports namespace. Typically a LocationUri is stored as namespace property, which can be used to override default table location in that namespace.

> At namespace creation time, users can choose either 1) use a pre-existing bucket; 2) let the Catalog create a new bucket.
I think we need to make this more generic, likely with the mechanism I described that is used for all other catalog implementations. Bucket is a resource with a maximum cap, so it would not scale well for multi-tenant use case for users with an unbounded number of logical namespaces.

> Isolate logical TableIdentifiers from physical S3 locations.
> Support rename table within the same namespace without touching S3.
Iceberg table has UUID, I think these are achievable directly based on Iceberg catalog and storage design. We should also be bale to rename table across namespace. Could you describe in more details for what feature is added here?

> Support various kinds of databases
Is there anything that the current implementation not support? I don't think it uses any SQL dialect that is incompatible across databases, but maybe I overlooked something here.

> Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL semantics.
I think we should evaluate a little bit about the difference of this library and the current implementation. I checked that the license of the library is fine, could you describe a bit why this is a better choice than the existing implementation?
We typically do not want to have too many third party dependencies. Given JdbcCatalog is in the core library, we should be very careful when adding new dependencies.

> Provide database initialization scripts for Postgres.
I think infrastructure setup like database initialization is out of scope of Iceberg library. We can add this as a part of the JDBC documentation.

Best,
Jack Ye

On Fri, Aug 27, 2021 at 12:54 PM Ryan Blue <bl...@tabular.io>> wrote:
Qinhua, thanks for sharing this. It sounds great to add more features to the JDBC catalog.

Could you share a link to the implementation or a PR? I have lots more questions like how you implemented namespaces, but those can probably be answered by looking at the code if you're able to share it.

Thanks!

Ryan

On Fri, Aug 27, 2021 at 11:48 AM Qinhua Yan <Qi...@twosigma.com>> wrote:

Hi there,

We’d like to share our JdbcCatalog impl with the community and welcome any discussion.

We are aware of the existing JdbcCatalog impl<https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>, however, it has some feature gaps and doesn’t work for our use case. Therefore, we implemented a SQL-database backed Catalog with the following enhancements.

1.       Namespace management and configuration

•       Each namespace can be backed by a different S3 bucket. This allows fine grained access control at the namespace level.

•       At namespace creation time, users can choose either 1) use a pre-existing bucket; 2) let the Catalog create a new bucket.

•       Isolate logical TableIdentifiers from physical S3 locations.

•       Support rename table within the same namespace without touching S3.

2.       Support various kinds of databases

•       Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL semantics.

•       Easy to support different kinds of SQL without touching the core Catalog code.

•       Provide database initialization scripts for Postgres.

This Catalog implementation can be easily extended to support some advanced features such as undelete tables and namespace-backed-by-multiple-backends.

Any comments and discussions are welcomed!

Thank you!
Qinhua Yan

--
Ryan Blue
Tabular

Re: Enhanced JdbcCatalog with Namespace Management

Posted by Jack Ye <ye...@gmail.com>.

Hi Qinhua,

+1 for what Ryan says, it would be great to have a PR to analyze the
features you list in detail. I have the following questions and comments:

> Namespace management and configuration
This is something we decided to implement in a second iteration when people
have a need, so if you have a desire to add namespace feature it would be
great to directly add it to the existing JdbcCatalog, I can review in more
details when you have the PR out.

> Each namespace can be backed by a different S3 bucket. This allows fine
grained access control at the namespace level.
This is common feature across all catalog implementations that supports
namespace. Typically a LocationUri is stored as namespace property, which
can be used to override default table location in that namespace.

> At namespace creation time, users can choose either 1) use a pre-existing
bucket; 2) let the Catalog create a new bucket.
I think we need to make this more generic, likely with the mechanism I
described that is used for all other catalog implementations. Bucket is a
resource with a maximum cap, so it would not scale well for multi-tenant
use case for users with an unbounded number of logical namespaces.

> Isolate logical TableIdentifiers from physical S3 locations.
> Support rename table within the same namespace without touching S3.
Iceberg table has UUID, I think these are achievable directly based on
Iceberg catalog and storage design. We should also be bale to rename table
across namespace. Could you describe in more details for what feature is
added here?

> Support various kinds of databases
Is there anything that the current implementation not support? I don't
think it uses any SQL dialect that is incompatible across databases, but
maybe I overlooked something here.

> Use Jooq <https://www.jooq.org/>to connect to the database and to ensure
SQL semantics.
I think we should evaluate a little bit about the difference of this
library and the current implementation. I checked that the license of the
library is fine, could you describe a bit why this is a better choice than
the existing implementation?
We typically do not want to have too many third party dependencies. Given
JdbcCatalog is in the core library, we should be very careful when adding
new dependencies.

> Provide database initialization scripts for Postgres.
I think infrastructure setup like database initialization is out of scope
of Iceberg library. We can add this as a part of the JDBC documentation.

Best,
Jack Ye






On Fri, Aug 27, 2021 at 12:54 PM Ryan Blue <bl...@tabular.io> wrote:

> Qinhua, thanks for sharing this. It sounds great to add more features to
> the JDBC catalog.
>
> Could you share a link to the implementation or a PR? I have lots more
> questions like how you implemented namespaces, but those can probably be
> answered by looking at the code if you're able to share it.
>
> Thanks!
>
> Ryan
>
> On Fri, Aug 27, 2021 at 11:48 AM Qinhua Yan <Qi...@twosigma.com>
> wrote:
>
>> Hi there,
>>
>>
>>
>> We’d like to share our JdbcCatalog impl with the community and welcome
>> any discussion.
>>
>> We are aware of the existing JdbcCatalog impl
>> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>,
>> however, it has some feature gaps and doesn’t work for our use case.
>> Therefore, we implemented a SQL-database backed Catalog with the following
>> enhancements.
>>
>> 1.       Namespace management and configuration
>>
>> ·       Each namespace can be backed by a different S3 bucket. This
>> allows fine grained access control at the namespace level.
>>
>> ·       At namespace creation time, users can choose either 1) use a
>> pre-existing bucket; 2) let the Catalog create a new bucket.
>>
>> ·       Isolate logical TableIdentifiers from physical S3 locations.
>>
>> ·       Support rename table within the same namespace without touching
>> S3.
>>
>> 2.       Support various kinds of databases
>>
>> ·       Use Jooq <https://www.jooq.org/>to connect to the database and
>> to ensure SQL semantics.
>>
>> ·       Easy to support different kinds of SQL without touching the core
>> Catalog code.
>>
>> ·       Provide database initialization scripts for Postgres.
>>
>>
>>
>> This Catalog implementation can be easily extended to support some
>> advanced features such as undelete tables and
>> namespace-backed-by-multiple-backends.
>>
>>
>>
>> Any comments and discussions are welcomed!
>>
>>
>>
>> Thank you!
>>
>> Qinhua Yan
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: Enhanced JdbcCatalog with Namespace Management

Posted by Ryan Blue <bl...@tabular.io>.

Qinhua, thanks for sharing this. It sounds great to add more features to
the JDBC catalog.

Could you share a link to the implementation or a PR? I have lots more
questions like how you implemented namespaces, but those can probably be
answered by looking at the code if you're able to share it.

Thanks!

Ryan

On Fri, Aug 27, 2021 at 11:48 AM Qinhua Yan <Qi...@twosigma.com> wrote:

> Hi there,
>
>
>
> We’d like to share our JdbcCatalog impl with the community and welcome any
> discussion.
>
> We are aware of the existing JdbcCatalog impl
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>,
> however, it has some feature gaps and doesn’t work for our use case.
> Therefore, we implemented a SQL-database backed Catalog with the following
> enhancements.
>
> 1.       Namespace management and configuration
>
> ·       Each namespace can be backed by a different S3 bucket. This
> allows fine grained access control at the namespace level.
>
> ·       At namespace creation time, users can choose either 1) use a
> pre-existing bucket; 2) let the Catalog create a new bucket.
>
> ·       Isolate logical TableIdentifiers from physical S3 locations.
>
> ·       Support rename table within the same namespace without touching
> S3.
>
> 2.       Support various kinds of databases
>
> ·       Use Jooq <https://www.jooq.org/>to connect to the database and to
> ensure SQL semantics.
>
> ·       Easy to support different kinds of SQL without touching the core
> Catalog code.
>
> ·       Provide database initialization scripts for Postgres.
>
>
>
> This Catalog implementation can be easily extended to support some
> advanced features such as undelete tables and
> namespace-backed-by-multiple-backends.
>
>
>
> Any comments and discussions are welcomed!
>
>
>
> Thank you!
>
> Qinhua Yan
>


-- 
Ryan Blue
Tabular