You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Mayur Srivastava <Ma...@twosigma.com> on 2021/05/11 18:46:02 UTC

Iceberg catalog questions

Hi,

I'm looking to use/implement a PostgreSQL based Iceberg catalog. I'm wondering if one already exists and also have a few questions. I would really appreciate any help I can get with the questions.

1.      Does Iceberg have a catalog that is compatible with PostgreSQL (or any storage backend that is compatible with PostgreSQL)?

a.      If there are similar implementations, could someone share their experience with the database schema used for the catalog? E.g. does a namespace map to a database in the catalog backend?

b.      Is there an existing abstract base class that I can use to implement the catalog that talks to PostgreSQL?

2.      Mapping catalog namespace with S3 bucket: does someone have a recommendation of managing catalog namespace along with AWS S3 (or GCS) buckets? For example, when a top level namespace is created in the catalog, do users map it to a bucket or a sub-directory structure on S3? (this may be useful for setting the similar access control for both catalog namespace and the S3 bucket.)

3.      Table access permission management: since metadata is stored in two separate systems (table metadata in S3 and namespace/table location in catalog), how are table access permissions kept in sync in these storage systems? E.g. if a catalog is used with GCS, how are the namespace/bucket/table access permissions kept in sync?

4.      Undeleting or recovering a dropped table: does the catalog support undelete operation? If the underlying S3 data is not purged, can the catalog be used to recover the dropped table?



Thanks,

Mayur


RE: Iceberg catalog questions

Posted by Mayur Srivastava <Ma...@twosigma.com>.
Thanks Jack and Yufei,

I’ll take a look at the pr. I hope it can be merged soon.

I agree, it looks like I’ll have to take care of access control myself.

For catalog level table undelete operation, I can enqueue deletions in a separate table where it can live for a few days so that it can be recovered if needed. I’m not sure if a feature like this is useful for others.

I’m not considering Hive metastore at the moment, but will take a look at it to check how it manages the catalog.

Thanks,
Mayur

From: Jack Ye <ye...@gmail.com>
Sent: Tuesday, May 11, 2021 3:14 PM
To: dev@iceberg.apache.org
Subject: Re: Iceberg catalog questions

For your subsequent questions:

2. mapping namespace name to the file path is only a convention, and can be overridden at both namespace and table level. The table root path can be customized to be at any location, and we actually recommend that for cloud storage use cases to reduce throttling.

3. access control has to be done across systems. For example, in the AWS Glue + S3 use case, the caller has to have permission to access both Glue and S3 with the correct IAM resource permissions. The permission control capability really depends on the platform you are operating on. It is a bit tricky for a relational database where you have to basically manage row-level access control, but it is technically achievable.

4. The behavior varies among catalog implementations. Technically, CatalogUtil.dropTableData is called to clean up files when purge is enabled in most implementations. In that case, it cleans up the metadata file, the manifest lists, the manifests and data files. That means if purge is not enabled, those files are still there and you can recover the table if you can rebuild the table pointer in the catalog. But this is a manual action, there is no Iceberg API support for it. In addition, If your storage has a file retention feature and you can recover the file, you can recover the dropped table version, but that is a storage-level feature but not an Iceberg feature.

-Jack


On Tue, May 11, 2021 at 11:55 AM Jack Ye <ye...@gmail.com>> wrote:
Yes there is one, but unfortunately we lost attention after some time: https://github.com/apache/iceberg/pull/1870

I think the PR is close to be merged with quite a few rounds of review already, we should add it as a milestone of 0.12.

-Jack

On Tue, May 11, 2021 at 11:46 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:

Hi,

I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m wondering if one already exists and also have a few questions. I would really appreciate any help I can get with the questions.

1.      Does Iceberg have a catalog that is compatible with PostgreSQL (or any storage backend that is compatible with PostgreSQL)?

a.      If there are similar implementations, could someone share their experience with the database schema used for the catalog? E.g. does a namespace map to a database in the catalog backend?

b.      Is there an existing abstract base class that I can use to implement the catalog that talks to PostgreSQL?

2.      Mapping catalog namespace with S3 bucket: does someone have a recommendation of managing catalog namespace along with AWS S3 (or GCS) buckets? For example, when a top level namespace is created in the catalog, do users map it to a bucket or a sub-directory structure on S3? (this may be useful for setting the similar access control for both catalog namespace and the S3 bucket.)

3.      Table access permission management: since metadata is stored in two separate systems (table metadata in S3 and namespace/table location in catalog), how are table access permissions kept in sync in these storage systems? E.g. if a catalog is used with GCS, how are the namespace/bucket/table access permissions kept in sync?

4.      Undeleting or recovering a dropped table: does the catalog support undelete operation? If the underlying S3 data is not purged, can the catalog be used to recover the dropped table?



Thanks,

Mayur


Re: Iceberg catalog questions

Posted by Jack Ye <ye...@gmail.com>.
For your subsequent questions:

2. mapping namespace name to the file path is only a convention, and can be
overridden at both namespace and table level. The table root path can be
customized to be at any location, and we actually recommend that for cloud
storage use cases to reduce throttling.

3. access control has to be done across systems. For example, in the AWS
Glue + S3 use case, the caller has to have permission to access both Glue
and S3 with the correct IAM resource permissions. The permission control
capability really depends on the platform you are operating on. It is a bit
tricky for a relational database where you have to basically manage
row-level access control, but it is technically achievable.

4. The behavior varies among catalog implementations. Technically,
CatalogUtil.dropTableData is called to clean up files when purge is enabled
in most implementations. In that case, it cleans up the metadata file, the
manifest lists, the manifests and data files. That means if purge is not
enabled, those files are still there and you can recover the table if you
can rebuild the table pointer in the catalog. But this is a manual action,
there is no Iceberg API support for it. In addition, If your storage has a
file retention feature and you can recover the file, you can recover the
dropped table version, but that is a storage-level feature but not an
Iceberg feature.

-Jack


On Tue, May 11, 2021 at 11:55 AM Jack Ye <ye...@gmail.com> wrote:

> Yes there is one, but unfortunately we lost attention after some time:
> https://github.com/apache/iceberg/pull/1870
>
> I think the PR is close to be merged with quite a few rounds of review
> already, we should add it as a milestone of 0.12.
>
> -Jack
>
> On Tue, May 11, 2021 at 11:46 AM Mayur Srivastava <
> Mayur.Srivastava@twosigma.com> wrote:
>
>> Hi,
>>
>> I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m
>> wondering if one already exists and also have a few questions. I would
>> really appreciate any help I can get with the questions.
>>
>> 1.      Does Iceberg have a catalog that is compatible with PostgreSQL
>> (or any storage backend that is compatible with PostgreSQL)?
>>
>> a.      If there are similar implementations, could someone share their
>> experience with the database schema used for the catalog? E.g. does a
>> namespace map to a database in the catalog backend?
>>
>> b.      Is there an existing abstract base class that I can use to
>> implement the catalog that talks to PostgreSQL?
>>
>> 2.      Mapping catalog namespace with S3 bucket: does someone have a
>> recommendation of managing catalog namespace along with AWS S3 (or GCS)
>> buckets? For example, when a top level namespace is created in the catalog,
>> do users map it to a bucket or a sub-directory structure on S3? (this may
>> be useful for setting the similar access control for both catalog namespace
>> and the S3 bucket.)
>>
>> 3.      Table access permission management: since metadata is stored in
>> two separate systems (table metadata in S3 and namespace/table location in
>> catalog), how are table access permissions kept in sync in these storage
>> systems? E.g. if a catalog is used with GCS, how are the
>> namespace/bucket/table access permissions kept in sync?
>>
>> 4.      Undeleting or recovering a dropped table: does the catalog
>> support undelete operation? If the underlying S3 data is not purged, can
>> the catalog be used to recover the dropped table?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>

Re: Iceberg catalog questions

Posted by Jack Ye <ye...@gmail.com>.
Yes there is one, but unfortunately we lost attention after some time:
https://github.com/apache/iceberg/pull/1870

I think the PR is close to be merged with quite a few rounds of review
already, we should add it as a milestone of 0.12.

-Jack

On Tue, May 11, 2021 at 11:46 AM Mayur Srivastava <
Mayur.Srivastava@twosigma.com> wrote:

> Hi,
>
> I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m
> wondering if one already exists and also have a few questions. I would
> really appreciate any help I can get with the questions.
>
> 1.      Does Iceberg have a catalog that is compatible with PostgreSQL
> (or any storage backend that is compatible with PostgreSQL)?
>
> a.      If there are similar implementations, could someone share their
> experience with the database schema used for the catalog? E.g. does a
> namespace map to a database in the catalog backend?
>
> b.      Is there an existing abstract base class that I can use to
> implement the catalog that talks to PostgreSQL?
>
> 2.      Mapping catalog namespace with S3 bucket: does someone have a
> recommendation of managing catalog namespace along with AWS S3 (or GCS)
> buckets? For example, when a top level namespace is created in the catalog,
> do users map it to a bucket or a sub-directory structure on S3? (this may
> be useful for setting the similar access control for both catalog namespace
> and the S3 bucket.)
>
> 3.      Table access permission management: since metadata is stored in
> two separate systems (table metadata in S3 and namespace/table location in
> catalog), how are table access permissions kept in sync in these storage
> systems? E.g. if a catalog is used with GCS, how are the
> namespace/bucket/table access permissions kept in sync?
>
> 4.      Undeleting or recovering a dropped table: does the catalog
> support undelete operation? If the underlying S3 data is not purged, can
> the catalog be used to recover the dropped table?
>
>
>
> Thanks,
>
> Mayur
>
>
>

Re: Iceberg catalog questions

Posted by Yufei Gu <fl...@gmail.com>.
Hi Mayur,

Did you try Hive Metastore local mode? In that case, you have your postgres
as the catalog DB, you can use all HMS's functionalities. However, you
still need to handle Table access permission(your 3rd point) by yourself.

Best,

Yufei

`This is not a contribution`


On Tue, May 11, 2021 at 11:46 AM Mayur Srivastava <
Mayur.Srivastava@twosigma.com> wrote:

> Hi,
>
> I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m
> wondering if one already exists and also have a few questions. I would
> really appreciate any help I can get with the questions.
>
> 1.      Does Iceberg have a catalog that is compatible with PostgreSQL
> (or any storage backend that is compatible with PostgreSQL)?
>
> a.      If there are similar implementations, could someone share their
> experience with the database schema used for the catalog? E.g. does a
> namespace map to a database in the catalog backend?
>
> b.      Is there an existing abstract base class that I can use to
> implement the catalog that talks to PostgreSQL?
>
> 2.      Mapping catalog namespace with S3 bucket: does someone have a
> recommendation of managing catalog namespace along with AWS S3 (or GCS)
> buckets? For example, when a top level namespace is created in the catalog,
> do users map it to a bucket or a sub-directory structure on S3? (this may
> be useful for setting the similar access control for both catalog namespace
> and the S3 bucket.)
>
> 3.      Table access permission management: since metadata is stored in
> two separate systems (table metadata in S3 and namespace/table location in
> catalog), how are table access permissions kept in sync in these storage
> systems? E.g. if a catalog is used with GCS, how are the
> namespace/bucket/table access permissions kept in sync?
>
> 4.      Undeleting or recovering a dropped table: does the catalog
> support undelete operation? If the underlying S3 data is not purged, can
> the catalog be used to recover the dropped table?
>
>
>
> Thanks,
>
> Mayur
>
>
>