You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/01 07:29:00 UTC

[GitHub] [iceberg] jackye1995 commented on issue #3044: Unable to use GlueCatalog in flink environments without hadoop

jackye1995 commented on issue #3044:
URL: https://github.com/apache/iceberg/issues/3044#issuecomment-910015259

> At present, I think adding to the known catalog-types might be the best path forward to resolve the issue more immediately.

That is actually something we did not want to do and that's why the aws module is not a part of the flink module dependency. Adding that dependency is likely good for AWS, but as more catalog implementations are added, it becomes not manageable to have that many dependencies.

> believe that one potential root issue is that FileIO has leaked out from TableOperations into catalog implementations like GlueCatalog.

This was also discussed when implementing the catalog. Having FileIO default definition at catalog level is a feature, that's why `CatalogProperties.FILE_IO_IMPL` was created. Initializing the default FileIO in catalog allows reusing the same FileIO singleton instead of creating many different ones.

I don't think the problem is solved even if you hide the `FileIO` creation in `TableOprations`, because `FileIO` also checks for the `Configurable` interface, it does not make much difference.

> Additional updates to the FlinkCatalogFactory are still needed on top of these changes in order to fully remove the hadoop dependency

Yes you are right, we can fully remove the dependency in `GlueCatalog`, but the issue is more on the engine side that basically requires such dependency. The Flink catalog entry point `CatalogFactory.createCatalog(String name, Map<String, String> properties)` has a direct call to `createCatalog(name, properties, clusterHadoopConf())` that initializes Hadoop configuration, and the serialized catalog loader `CustomCatalogLoader` has `SerializableConfiguration` as a field, so you are guaranteed to get serialization exception in Flink if you don't have the Hadoop configuration. This looks like a deeper issue than just a fix at catalog side.

I think we should first tackle this on engine side, and then see what's the best way forward for catalog implementations. This seems like a valid ask for Flink catalog factory improvement.

Meanwhile, although a bit hacky, why not add just 2 empty classes `Configuration` and `Configurable` to your classpath? That removes the need for the entire hadoop jar.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org