You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/01 23:11:36 UTC

[GitHub] [iceberg] ericlgoodman commented on pull request #5331: WIP: Adding support for Delta to Iceberg migration

ericlgoodman commented on PR #5331:
URL: https://github.com/apache/iceberg/pull/5331#issuecomment-1201826444

Adding here my primary concern with this PR - and in general a concern going forward with Spark and using multiple tables such as Delta Lake and Iceberg.

Spark reads tables through whatever catalog is located at the first part of a table's identifier. There can only be 1 catalog per identifier, and different catalogs have different capabilities. For example, the `DeltaCatalog` can read Delta Lake and generic Hive tables, and the `SparkSessionCatalog` can read Iceberg + Hive tables.

In theory, in order to read from multiple table types in one Spark session, a user would initialize a `DeltaCatalog`, at say, `delta` and then the `SparkSessionCatalog` at `iceberg`. Then all their Delta Lake tables would be located at `delta.my_delta_database.my_delta_lake_table` and all their Iceberg tables at `iceberg.my_iceberg_database.my_iceberg_table`. Unfortunately, this doesn't work out of the box. Both of these catalog implementations are designed to be used by overriding the default Spark catalog, which is located at `spark_catalog`. `CatalogExtension`, from which `DeltaCatalog` and `SparkSessionCatalog` both inherit from, contains a method `setDelegateCatalog(CatalogPlugin delegate)`. As the Javadoc reads:

```java
/**
* This will be called only once by Spark to pass in the Spark built-in session catalog, after
* {@link #initialize(String, CaseInsensitiveStringMap)} is called.
*/
void setDelegateCatalog(CatalogPlugin delegate);
```

A user can fix this issue by manually calling this method during Spark setup and setting the delegate to the one in the default Spark catalog. But most users presumably are not doing this, and some users might face difficulty depending on their service provider and how much abstraction/configuration has been taken away from them during setup.

This basically means that in today's world, it doesn't seem realistic that users currently have a simple way to use one Spark session to read/migrate between different table types. This solution might make make sense to implement first, as users may find that a Delta/Iceberg/Hudi table makes sense for them in one context but another one is preferable in another.

When it comes to migration, there are basically two options:
1. Create a more abstract Catalog implementation that can read Iceberg/Delta/Hudi/Hive tables dynamically, similar to what happens in the Trino Hive connector. The connector inspects the table properties and determines at runtime whether to redirect to another connector. Similarly, a Spark catalog could simply delegate to specific catalogs if it sees certain table type specific properties.
2. Provide an easier method for users to not have to override the default catalog for these table type specific catalog implementations. If the Delta catalog was located at `delta`, and Iceberg at `iceberg`, then users could just keep their different table types in different catalogs and migration could take an optional parameter of the new desired catalog.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org