You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/07 00:41:24 UTC
[GitHub] [iceberg] openinx commented on a change in pull request #2544: Doc: refactor Hive documentation with catalog loading examples

openinx commented on a change in pull request #2544:
URL: https://github.com/apache/iceberg/pull/2544#discussion_r627850521



##########
File path: site/docs/hive.md
##########
@@ -17,117 +17,324 @@
 
 # Hive
 
-## Hive read support
-Iceberg supports the reading of Iceberg tables from [Hive](https://hive.apache.org) by using a [StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers). Please note that only Hive 2.x versions are currently supported.
+Iceberg supports reading and writing Iceberg tables through [Hive](https://hive.apache.org) by using a [StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers).
+Here is the current compatibility matrix for Iceberg Hive support: 
 
-### Table creation
-This section explains the various steps needed in order to overlay a Hive table "on top of" an existing Iceberg table. Iceberg tables are created using either a [`Catalog`](./javadoc/master/index.html?org/apache/iceberg/catalog/Catalog.html) or an implementation of the [`Tables`](./javadoc/master/index.html?org/apache/iceberg/Tables.html) interface and Hive needs to be configured accordingly to read data from these different types of table.
+| Feature                  | Hive 2.x               | Hive 3.1.2             |
+| ------------------------ | ---------------------- | ---------------------- |
+| CREATE EXTERNAL TABLE    | ✔️                     | ✔️                     |
+| CREATE TABLE             | ✔️                     | ✔️                     |
+| DROP TABLE               | ✔️                     | ✔️                     |
+| SELECT                   | ✔️ (MapReduce and Tez) | ✔️ (MapReduce and Tez) |
+| INSERT INTO              | ✔️ (MapReduce only)️    | ✔️ (MapReduce only)    |
 
-#### Add the Iceberg Hive Runtime jar file to the Hive classpath
-Regardless of the table type, the `HiveIcebergStorageHandler` and supporting classes need to be made available on Hive's classpath. These are provided by the `iceberg-hive-runtime` jar file. For example, if using the Hive shell, this can be achieved by issuing a statement like so:
-```sql
+## Enabling Iceberg support in Hive
+
+### Loading runtime jar
+
+To enable Iceberg support in Hive, the `HiveIcebergStorageHandler` and supporting classes need to be made available on Hive's classpath. 
+These are provided by the `iceberg-hive-runtime` jar file. 
+For example, if using the Hive shell, this can be achieved by issuing a statement like so:
+
+```
 add jar /path/to/iceberg-hive-runtime.jar;
 ```
-There are many others ways to achieve this including adding the jar file to Hive's auxiliary classpath (so it is available by default) - please refer to Hive's documentation for more information.
 
-#### Using Hadoop Tables
-Iceberg tables created using `HadoopTables` are stored entirely in a directory in a filesystem like HDFS.
+There are many others ways to achieve this including adding the jar file to Hive's auxiliary classpath so it is available by default.
+Please refer to Hive's documentation for more information.
 
-##### Create an Iceberg table
-The first step is to create an Iceberg table using the Spark/Java/Python API and `HadoopTables`. For the purposes of this documentation we will assume that the table is called `table_a` and that the table location is `hdfs://some_path/table_a`.
+### Enabling support
 
-##### Create a Hive table
-Now overlay a Hive table on top of this Iceberg table by issuing Hive DDL like so:
-```sql
-CREATE EXTERNAL TABLE table_a 
-STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
-LOCATION 'hdfs://some_bucket/some_path/table_a';
+If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg table when the storage handler is set.
+To avoid the appearance of broken tables in Hive, Iceberg will not add the storage handler to a table unless Hive support is enabled.
+The storage handler is kept in sync (added or removed) every time a table is updated.
+There are two ways to enable Hive support: globally in Hadoop Configuration and per-table using a table property.
+
+#### Hadoop configuration
+
+To enable Hive support globally for an application, set `iceberg.engine.hive.enabled=true` in its Hadoop configuration. 
+For example, setting this in the `hive-site.xml` loaded by Spark will enable the storage handler for all tables created by Spark.
+
+!!! Warning
+    When using Tez, you also have to disable vectorization for now (`hive.vectorized.execution.enabled=false`)
+
+#### Table property configuration
+
+Alternatively, the property `engine.hive.enabled` can be set to `true` and added to the table properties when creating the Iceberg table. 
+Here is an example of doing it programmatically:
+
+```java
+Catalog catalog = ...;
+Map<String, String> tableProperties = Maps.newHashMap();
+tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); // engine.hive.enabled=true
+catalog.createTable(tableId, schema, spec, tableProperties);
 ```
 
-#### Query the Iceberg table via Hive
-You should now be able to issue Hive SQL `SELECT` queries using the above table and see the results returned from the underlying Iceberg table.
-```sql
-SELECT * from table_a;
+The table level configuration overrides the global Hadoop configuration.
+
+## Catalog Management
+
+### Global Hive catalog
+
+From the Hive engine's perspective, there is only one global data catalog that is defined in the Hadoop configuration in the runtime environment.
+On contrast, Iceberg supports multiple different data catalog types such as Hive, Hadoop, AWS Glue, or custom catalog implementations.
+Iceberg also allows loading a table directly based on its path in the file system. Those tables do not belong to any catalog.
+Users might want to read these cross-catalog and path-based tables through the Hive engine for use cases like join.
+
+To support this, a table in the Hive metastore can represent three different ways of loading an Iceberg table,
+depending on the table's `iceberg.catalog` property:
+
+1. The table will be loaded using a `HiveCatalog` that corresponds to the metastore configured in the Hive environment if no `iceberg.catalog` is set
+2. The table will be loaded using a custom catalog if `iceberg.catalog` is set to a catalog name (see below)
+3. The table can be loaded directly using the table's root location if `iceberg.catalog` is set to `location_based_table`
+
+For cases 2 and 3 above, users can create an overlay of an Iceberg table in the Hive metastore,
+so that different table types can work together in the same Hive environment.
+See [CREATE EXTERNAL TABLE](#create-external-table) for more details.
+
+### Custom Iceberg catalogs
+
+To globally register different catalogs, set the following Hadoop configurations:
+
+| Config Key                                    | Description                                            |
+| --------------------------------------------- | ------------------------------------------------------ |
+| iceberg.catalog.<catalog_name\>.type          | type of catalog: `hive`,`hadoop` or `custom`             |
+| iceberg.catalog.<catalog_name\>.catalog-impl  | catalog implementation, must not be null if type is `custom` |
+| iceberg.catalog.<catalog_name\>.<key\>        | any config key and value pairs for the catalog         |
+
+Here are some examples using Hive CLI:
+
+Register a `HiveCatalog` called `another_hive`:
+
 ```
+SET iceberg.catalog.another_hive.type=hive;
+SET iceberg.catalog.another_hive.uri=thrift://example.com:9083;
+SET iceberg.catalog.another_hive.clients=10;
+SET iceberg.catalog.another_hive.warehouse=hdfs://example.com:8020/warehouse;
+```
+
+Register a `HadoopCatalog` called `hadoop`:
 
-#### Using Hive Catalog
-Iceberg tables created using `HiveCatalog` are automatically registered with Hive.
+```
+SET iceberg.catalog.hadoop.type=hadoop;
+SET iceberg.catalog.hadoop.warehouse=hdfs://example.com:8020/warehouse;
+```
 
-##### Create an Iceberg table
-The first step is to create an Iceberg table using the Spark/Java/Python API and `HiveCatalog`. For the purposes of this documentation we will assume that the table is called `table_b` and that the table location is `s3://some_path/table_b`. In order for Iceberg to correctly set up the Hive table for querying some configuration values need to be set, the two options for this are described below - you can use either or the other depending on your use case.
+Register an AWS `GlueCatalog` called `glue`:
 
-##### Hive Configuration
-The value `iceberg.engine.hive.enabled` needs to be set to `true` and added to the Hive configuration file on the classpath of the application creating or modifying (altering, inserting etc.) the table. This can be done by modifying the relevant `hive-site.xml`. Alternatively this can be done programmatically like so:
-```java
-Configuration hadoopConfiguration = spark.sparkContext().hadoopConfiguration();
-hadoopConfiguration.set(ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
-HiveCatalog catalog = new HiveCatalog(hadoopConfiguration);
-...
-catalog.createTable(tableId, schema, spec);
+```
+SET iceberg.catalog.glue.type=custom;

Review comment:
       How about adding the line `SET iceberg.catalog=glue;` in this custom catalog example ?  Because I think people would prefer to looking for a full example to make the demo run, according to this [part](https://github.com/apache/iceberg/pull/2544/files#diff-0270f04c6c1a4be5da895415fff2797103da7ded6ec97c303f2f7e218e99ac26R88).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org