You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/29 01:09:05 UTC

[GitHub] [iceberg] rdblue commented on pull request #2129: Hive: Configure catalog type on table level.

rdblue commented on pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#issuecomment-769503071


   This made me think a lot about what the behavior should be. Before diving further into the review, I think we should agree on that.
   
   We have 3 cases for loading tables:
   * Hadoop table identified by its location
   * Hive catalog table tracked in the current metastore
   * A table in any catalog where the metastore is a reference
   
   Ideally, a Hive table that is owned by the Hive metastore would require no additional property to point it to a catalog, because the metastore is unnamed and is basically global. A table in another catalog would have a table property, `iceberg.catalog`, that has the other catalog's name. That leaves Hadoop tables and I would use a property like `iceberg.type=hadoop` to signal that the table location should be used to load it. I would not use a reserved catalog name to identify Hadoop tables to avoid colliding with user catalog names (e.g., "hadoop"), but we could make it reasonably unlikely if we wanted to use `iceberg.catalog=location_based_table` or something.
   
   We also have existing behavior:
   * `iceberg.mr.catalog` is not defined: load all Iceberg tables as Hadoop tables using the location
   * `iceberg.mr.catalog=hive`: load all Iceberg tables using the table name from a HiveCatalog for the metastore URI
   * `iceberg.mr.catalog=hadoop`: load all Iceberg tables using the table name from a HadoopCatalog for the location stored in `iceberg.mr.catalog.hadoop.warehouse.location`
   * `iceberg.mr.catalog.loader.class`: when defined, load all Iceberg tables using the catalog produced by the loader class. This is primarily for direct use of the `InputFormat`, not for Hive.
   
   Unfortunately, the current behavior when `iceberg.mr.catalog` is not set and the table has no catalog set is to load the table's location. The "ideal" behavior I outlined above would load the table from an anonymous `HiveCatalog` created for the metastore URI. I don't really like the option of adding more table-level config if we can avoid it.
   
   Here's what I propose:
   * If a table has `iceberg.catalog` or `iceberg.type=hadoop`, then use the ideal behavior (no conflict with existing)
   * If a table does not have those properties, fall back to the value of `iceberg.mr.catalog`:
       * When not defined: load the Iceberg table from the metastore by name
       * `location`: load the Iceberg table as a Hadoop table by location
       * `hive`: use the existing behavior for backward-compatibility
       * `hadoop`: use the existing behavior for backward-compatibility
   
   I think this is a good compromise for behavior. If someone wants the default to be location-based table loading, then all they need to change is to set `iceberg.mr.catalog=location`. The other cases are unchanged, and the default is to load tables from the metastore (as expected). While it is a behavior change, I think it is a reasonable one.
   
   Miscellaneous:
   * I didn't mention a property to control table identifier above, like `iceberg.identifier`, but I think that would make sense. I would not support it for tables tracked in the metastore, though. It does not make sense to have a pointer to another table owned by the metastore.
   * Because the loader class is intended for the `InputFormat`, I think we should handle it in Hive by making sure it is not set or by using a different load method that ignores it. There should be no need to provide backward compatibility for this and we haven't documented it anywhere.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org