You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/09 01:39:41 UTC

[GitHub] [iceberg] rdsr commented on a change in pull request #1172: Docs: Update spark.md for Spark 3

rdsr commented on a change in pull request #1172:
URL: https://github.com/apache/iceberg/pull/1172#discussion_r451772596



##########
File path: site/docs/spark.md
##########
@@ -19,42 +19,272 @@
 
 Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions.
 
-| Feature support                              | Spark 2.4 | Spark 3.0 (unreleased) | Notes                                          |
-|----------------------------------------------|-----------|------------------------|------------------------------------------------|
-| [DataFrame reads](#reading-an-iceberg-table) | ✔️        | ✔️                     |                                                |
-| [DataFrame append](#appending-data)          | ✔️        | ✔️                     |                                                |
-| [DataFrame overwrite](#overwriting-data)     | ✔️        | ✔️                     | Overwrite mode replaces partitions dynamically |
-| [Metadata tables](#inspecting-tables)        | ✔️        | ✔️                     |                                                |
-| SQL create table                             |           | ✔️                     |                                                |
-| SQL alter table                              |           | ✔️                     |                                                |
-| SQL drop table                               |           | ✔️                     |                                                |
-| SQL select                                   |           | ✔️                     |                                                |
-| SQL create table as                          |           | ✔️                     |                                                |
-| SQL replace table as                         |           | ✔️                     |                                                |
-| SQL insert into                              |           | ✔️                     |                                                |
-| SQL insert overwrite                         |           | ✔️                     |                                                |
+| Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
+|--------------------------------------------------|----------|------------|------------------------------------------------|
+| [SQL create table](#create-table)                | ✔️        |            |                                                |
+| [SQL create table as](#create-table-as-select)   | ✔️        |            |                                                |
+| [SQL replace table as](#replace-table-as-select) | ✔️        |            |                                                |
+| [SQL alter table](#alter-table)                  | ✔️        |            |                                                |
+| [SQL drop table](#drop-table)                    | ✔️        |            |                                                |
+| [SQL select](#querying-with-sql)                 | ✔️        |            |                                                |
+| [SQL insert into](#insert-into)                  | ✔️        |            |                                                |
+| [SQL insert overwrite](#insert-overwrite)        | ✔️        |            |                                                |
+| [DataFrame reads](#querying-with-dataframes)     | ✔️        | ✔️          |                                                |
+| [DataFrame append](#appending-data)              | ✔️        | ✔️          |                                                |
+| [DataFrame overwrite](#overwriting-data)         | ✔️        | ✔️          | ⚠ Behavior changed in Spark 3.0                |
+| [DataFrame CTAS and RTAS](#creating-tables)      | ✔️        |            |                                                |
+| [Metadata tables](#inspecting-tables)            | ✔️        | ✔️          |                                                |
+
+## Configuring catalogs
+
+Spark 3.0 adds an API to plug in table catalogs that are used to load, create, and manage Iceberg tables. Spark catalogs are configured by setting Spark properties under `spark.sql.catalog`.
+
+This creates an Iceberg catalog named `hive_prod` that loads tables from a Hive metastore:
+
+```plain
+spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hive_prod.type = hive
+spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
+```
+
+Iceberg also supports a directory-based catalog in HDFS that can be configured using `type=hadoop`:
+
+```plain
+spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hadoop_prod.type = hive
+spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path
+```
+
+!!! Note
+    The Hive-based catalog only loads Iceberg tables. To load non-Iceberg tables in the same Hive metastore, use a [session catalog](#replacing-the-session-catalog).
+
+### Using catalogs
+
+Catalog names are used in SQL queries to identify a table. In the examples above, `hive_prod` and `hadoop_prod` can be used to prefix database and table names that will be loaded from those catalogs.
+
+```sql
+SELECT * FROM hive_prod.db.table -- load db.table from catalog hive_prod
+```
+
+Spark 3 keeps track of a current catalog and namespace, which can be omitted from table names.

Review comment:
       nit: should this be  "... track of *the* current catalog..."

##########
File path: site/docs/spark.md
##########
@@ -19,42 +19,272 @@
 
 Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions.
 
-| Feature support                              | Spark 2.4 | Spark 3.0 (unreleased) | Notes                                          |
-|----------------------------------------------|-----------|------------------------|------------------------------------------------|
-| [DataFrame reads](#reading-an-iceberg-table) | ✔️        | ✔️                     |                                                |
-| [DataFrame append](#appending-data)          | ✔️        | ✔️                     |                                                |
-| [DataFrame overwrite](#overwriting-data)     | ✔️        | ✔️                     | Overwrite mode replaces partitions dynamically |
-| [Metadata tables](#inspecting-tables)        | ✔️        | ✔️                     |                                                |
-| SQL create table                             |           | ✔️                     |                                                |
-| SQL alter table                              |           | ✔️                     |                                                |
-| SQL drop table                               |           | ✔️                     |                                                |
-| SQL select                                   |           | ✔️                     |                                                |
-| SQL create table as                          |           | ✔️                     |                                                |
-| SQL replace table as                         |           | ✔️                     |                                                |
-| SQL insert into                              |           | ✔️                     |                                                |
-| SQL insert overwrite                         |           | ✔️                     |                                                |
+| Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
+|--------------------------------------------------|----------|------------|------------------------------------------------|
+| [SQL create table](#create-table)                | ✔️        |            |                                                |
+| [SQL create table as](#create-table-as-select)   | ✔️        |            |                                                |
+| [SQL replace table as](#replace-table-as-select) | ✔️        |            |                                                |
+| [SQL alter table](#alter-table)                  | ✔️        |            |                                                |
+| [SQL drop table](#drop-table)                    | ✔️        |            |                                                |
+| [SQL select](#querying-with-sql)                 | ✔️        |            |                                                |
+| [SQL insert into](#insert-into)                  | ✔️        |            |                                                |
+| [SQL insert overwrite](#insert-overwrite)        | ✔️        |            |                                                |
+| [DataFrame reads](#querying-with-dataframes)     | ✔️        | ✔️          |                                                |
+| [DataFrame append](#appending-data)              | ✔️        | ✔️          |                                                |
+| [DataFrame overwrite](#overwriting-data)         | ✔️        | ✔️          | ⚠ Behavior changed in Spark 3.0                |
+| [DataFrame CTAS and RTAS](#creating-tables)      | ✔️        |            |                                                |
+| [Metadata tables](#inspecting-tables)            | ✔️        | ✔️          |                                                |
+
+## Configuring catalogs
+
+Spark 3.0 adds an API to plug in table catalogs that are used to load, create, and manage Iceberg tables. Spark catalogs are configured by setting Spark properties under `spark.sql.catalog`.
+
+This creates an Iceberg catalog named `hive_prod` that loads tables from a Hive metastore:
+
+```plain
+spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hive_prod.type = hive
+spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
+```
+
+Iceberg also supports a directory-based catalog in HDFS that can be configured using `type=hadoop`:
+
+```plain
+spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hadoop_prod.type = hive

Review comment:
       Did you mean `spark.sql.catalog.hadoop_prod.type = hadoop`

##########
File path: site/docs/spark.md
##########
@@ -91,50 +332,188 @@ spark.sql("""select count(1) from table""").show()
 ```
 
 
+## Writing with SQL
+
+Spark 3 supports SQL `INSERT INTO` and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API.
+
+### `INSERT INTO`
+
+To append new data to a table, use `INSERT INTO`.
+
+```sql
+INSERT INTO prod.db.table VALUES (1, 'a'), (2, 'b')
+```
+```sql
+INSERT INTO prod.db.table SELECT ...
+```
+
+### `INSERT OVERWRITE`
+
+To replace data in the table with the result of a query, use `INSERT OVERWRITE`. Overwrites are atomic operations for Iceberg tables.
+
+The partitions that will be replaced by `INSERT OVERWRITE` depends on Spark's partition overwrite mode and the partitioning of a table.
+
+#### Overwrite behavior
+
+Spark's default overwrite mode is **static**, but **dynamic overwrite mode is recommended when writing to Iceberg tables.** Static overwrite mode determines which partitions to overwrite in a table by converting the `PARTITION` clause to a filter, but the `PARTITION` clause can only reference table columns.
+
+Dynamic overwrite mode is configured by setting `spark.sql.sources.partitionOverwriteMode=dynamic`.
+
+To demonstrate the behavior of dynamic and static overwrites, consider a `logs` table defined by the following DDL:
+
+```sql
+CREATE TABLE prod.my_app.logs (
+    uuid string NOT NULL,
+    level string NOT NULL,
+    ts timestamp NOT NULL,
+    message string)
+USING iceberg
+PARTITIONED BY (level, hours(ts))
+```
+
+#### Dynamic overwrite
+
+When Spark's overwrite mode is dynamic, partitions that have rows produced by the `SELECT` query will be replaced.
+
+For example, this query removes duplicate log events from the example `logs` table.
+
+```sql
+INSERT OVERWRITE prod.my_app.logs
+SELECT uuid, first(level), first(ts), first(message)
+FROM prod.my_app.logs
+WHERE cast(ts as date) = '2020-07-01'
+GROUP BY uuid
+```
+
+In dynamic mode, this will replace any partition with rows in the `SELECT` result. Because the date of all rows is restricted 1 July, only hours of that day will be replaced.

Review comment:
       nit: `restricted to 1st July`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org