You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by xu...@apache.org on 2022/04/27 02:22:55 UTC

[hudi] branch asf-site updated: [HUDI-3925][DOCS] add schema evolution for sparksql (#5238)

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 86c0529f31 [HUDI-3925][DOCS] add schema evolution for sparksql (#5238)
86c0529f31 is described below

commit 86c0529f313f6b8be1e26f3323f7be328d8151c7
Author: xiarixiaoyao <me...@qq.com>
AuthorDate: Wed Apr 27 10:22:50 2022 +0800

    [HUDI-3925][DOCS] add schema evolution for sparksql (#5238)
---
 website/docs/quick-start-guide.md | 172 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md
index 51d0b838f7..a6df0de985 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -1113,6 +1113,178 @@ Call command has already support some commit procedures and table optimization p
 more details please refer to [procedures](/docs/next/procedures).
 
 
+## Schema evolution
+Schema evolution allows users to easily change the current schema of a Hudi table to adapt to the data that is changing over time.
+As of 0.11.0 release, Spark SQL(spark3.1.x and spark3.2.1) DDL support for Schema  evolution has been added and is experimental.
+
+### Schema Evolution Scenarios
+1) Columns (including nested columns) can be added, deleted, modified, and moved.
+2) Partition columns cannot be evolved.
+3) You cannot add, delete, or perform operations on nested columns of the Array type.
+
+## SparkSQL Schema Evolution and Syntax Description
+Before using schema evolution, pls set `spark.sql.extensions`. For spark3.2.1 `spark.sql.catalog.spark_catalog` also need to be set.
+```shell
+# Spark SQL for spark 3.1.x
+spark-sql --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.1.2 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# Spark SQL for spark 3.2.1
+spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.2.1 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
+--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
+
+```
+After start spark-app,  pls exec `set schema.on.read.enable=true` to enable schema evolution.
+
+:::note
+Currently, Schema evolution cannot disabled once being enabled.
+
+
+:::
+
+### Adding Columns
+**Syntax**
+```sql
+-- add columns
+ALTER TABLE Table name ADD COLUMNS(col_spec[, col_spec ...])
+```
+**Parameter Description**
+
+| Parameter       | Description                  |
+|-----------------|------------------------------|
+| tableName       | Table name                   |
+| col_spec        | Column specifications, consisting of five fields, *col_name*, *col_type*, *nullable*, *comment*, and *col_position*.|
+
+**col_name** : name of the new column. It is mandatory.To add a sub-column to a nested column, specify the full name of the sub-column in this field.
+
+For example:
+
+1. To add sub-column col1 to a nested struct type column column users struct<name: string, age: int>, set this field to users.col1.
+
+2. To add sub-column col1 to a nested map type column memeber map<string, struct<n: string, a: int>>, set this field to member.value.col1.
+
+**col_type** : type of the new column.
+
+**nullable** : whether the new column can be null. The value can be left empty. Now this field is not used in Hudi.
+
+**comment** : comment of the new column. The value can be left empty.
+
+**col_position** : position where the new column is added. The value can be *FIRST* or *AFTER* origin_col.
+
+1. If it is set to *FIRST*, the new column will be added to the first column of the table.
+
+2. If it is set to *AFTER* origin_col, the new column will be added after original column origin_col.
+
+3. The value can be left empty. *FIRST* can be used only when new sub-columns are added to nested columns. Do not use *FIRST* in top-level columns. There are no restrictions about the usage of *AFTER*.
+
+**Examples**
+
+```sql
+alter table h0 add columns(ext0 string);
+alter table h0 add columns(new_col int not null comment 'add new column' after col1);
+alter table complex_table add columns(col_struct.col_name string comment 'add new column to a struct col' after col_from_col_struct);
+```
+
+### Altering Columns
+**Syntax**
+```sql
+-- alter table ... alter column
+ALTER TABLE Table name ALTER [COLUMN] col_old_name TYPE column_type [COMMENT] col_comment[FIRST|AFTER] column_name
+```
+
+**Parameter Description**
+
+| Parameter       | Description                  |
+|-----------------|------------------------------|
+| tableName      | Table name.                   |
+| col_old_name   | Name of the column to be altered.|
+| column_type    | Type of the target column.|
+| col_comment    | col_comment.|
+| column_name    | New position to place the target column. For example, *AFTER* **column_name** indicates that the target column is placed after **column_name**.|
+
+
+**Examples**
+
+```sql
+--- Changing the column type
+ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
+
+--- Altering other attributes
+ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
+ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
+ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
+ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
+```
+
+**column type change**
+
+| old_type        | new_type                        |
+|-----------------|---------------------------------|
+| int             | long/float/double/string/decimal|
+| long            | double/string/decimal           |
+| float           | double/String/decimal           |
+| double          | string/decimal                  |
+| decimal         | decimal/string                  |
+| string          | decimal/date                    |
+| date            | string                          |
+
+### Deleting Columns
+**Syntax**
+```sql
+-- alter table ... drop columns
+ALTER TABLE tableName DROP COLUMN|COLUMNS cols
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 DROP COLUMN a.b.c
+ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
+```
+
+### Changing Column Name
+**Syntax**
+```sql
+-- alter table ... rename column
+ALTER TABLE tableName RENAME COLUMN old_columnName TO new_columnName
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 RENAME COLUMN a.b.c TO x
+```
+
+### Modifying Table Properties
+**Syntax**
+```sql
+-- alter table ... set|unset
+ALTER TABLE Table name SET|UNSET tblproperties
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value')
+ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
+```
+
+### Changing a Table Name
+**Syntax**
+```sql
+-- alter table ... rename
+ALTER TABLE tableName RENAME TO newTableName
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 RENAME TO table2
+```
+
 ## Where to go from here?
 
 You can also do the quickstart by [building hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source),