You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by si...@apache.org on 2021/08/25 14:55:06 UTC
[hudi] branch asf-site updated: [HUDI-2356] Fixing spark quick
start guide for spark-sql (#3532)
This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ed6dba6 [HUDI-2356] Fixing spark quick start guide for spark-sql (#3532)
ed6dba6 is described below
commit ed6dba631f3b341cd1a09ef53e23422e611c7b8c
Author: Sivabalan Narayanan <si...@uber.com>
AuthorDate: Wed Aug 25 10:54:49 2021 -0400
[HUDI-2356] Fixing spark quick start guide for spark-sql (#3532)
---
website/docs/quick-start-guide.md | 151 ++++++++++++++++++++++++++------------
1 file changed, 105 insertions(+), 46 deletions(-)
diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md
index a6ed3ba..8f60f86 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -144,33 +144,60 @@ can generate sample inserts and updates based on the the sample trip schema [her
## Create Table
-Hudi support create table using spark-sql.
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
+]}>
+<TabItem value="scala">
-- COW & MOR table
-
- We can create a COW table or MOR table by specified the **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table.
+```scala
+# scala
+ // No separate create table command required in spark. First batch of write to a table will create the table if not exists.
+```
+
+</TabItem>
+<TabItem value="python">
+```python
+# pyspark
+ // No separate create table command required in spark. First batch of write to a table will create the table if not exists.
+```
-- Partitioned & Non-Partitioned table
+</TabItem>
+<TabItem value="sparksql">
- We can use the **partitioned by** statement to specified the partition columns to create a partitioned table.
-
-- Managed & External table
+Spark-sql needs an explicit create table command.
- If the table has not specified the **location** statement, it is a managed table. Or else it is an exteranl table.
-
-- Table with primary key
+- Table types:
+ Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql.
- If the table has specified the **primaryKey** column in options, it is a table with primary key.
-
+ While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table.
+
+- Partitioned & Non-Partitioned table:
+ Users can create a partitioned table or non-partitioned table in spark-sql.
+ To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table.
+ When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table.
+
+- Managed & External table:
+ In general, spark-sql supports two kinds of tables, namely managed and external. If one specifies a location using **location** statement, it is an external table, else its considered a managed table. You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/).
+
+- Table with primary key:
+ Users can choose to create a table with primary key if need be. Else table is considered a non-primary keyed table.
+ If the user has specified the **primaryKey** column in options, table is considered to be a primary key table.
+ If you are using any of the built-in key generators in Hudi, likely its a primary key table.
-**Create Non-Partitioned Table**
+Let's go over some of the create table commands.
-Here is an example of creating a non-partitioned COW managed table with a primary key 'id'.
+**Create a Non-Partitioned Table**
+
+Here is an example of creating a managed non-partitioned COW managed table with a primary key 'id'.
```sql
-- create a managed cow table
-create table if not exists h0(
+create table if not exists hudi_table0 (
id int,
name string,
price double
@@ -181,31 +208,30 @@ options (
);
```
-This is an example of creating an MOR external table which specified the table's location. The **preCombineField** option
+Here is an example of creating an MOR external table (location needs to be specified). The **preCombineField** option
is used to specify the preCombine field for merge.
```sql
-- creae an external mor table
-create table if not exists h1(
+create table if not exists hudi_table1 (
id int,
name string,
price double,
ts bigint
) using hudi
-location '/tmp/hudi/h0'
+location '/tmp/hudi/hudi_table1'
options (
type = 'mor',
primaryKey = 'id,name',
preCombineField = 'ts'
-)
-;
+);
```
Here is the example of creating a COW table without primary key.
```sql
-- create a non-primary key table
-create table if not exists h2(
+create table if not exists hudi_table2(
id int,
name string,
price double
@@ -214,26 +240,33 @@ options (
type = 'cow'
);
```
+
+Here is an example of creating an external COW partitioned table.
+
**Create Partitioned Table**
```sql
-create table if not exists h_p0 (
+create table if not exists hudi_table_p0 (
id bigint,
name string,
dt stringļ¼
hh string
) using hudi
-location '/tmp/hudi/h_p0'
+location '/tmp/hudi/hudi_table_p0'
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
-partitioned by (dt, hh)
-;
+partitioned by (dt, hh);
```
-**Create Table On The Exists Table Path**
-We can create a table on an exists hudi table path. This is useful to read/write from a pre-exists hudi table by spark-sql.
+If you wish to use spark-sql for an already existing hudi table (created pre 0.9.0), it is possible with the below command.
+
+**Create Table for an existing Hudi Table**
+
+We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to
+read/write to/from a pre-existing hudi table.
+
```sql
create table h_p1 using hudi
options (
@@ -241,38 +274,62 @@ We can create a table on an exists hudi table path. This is useful to read/write
preCombineField = 'ts'
)
partitioned by (dt)
- location '/path/to/hudi'
+ location '/path/to/hudi';
```
**CTAS**
-Hudi support CTAS on spark sql. For better performance to loading data to hudi table, CTAS use the **bulk insert**
-write operation.
+Hudi supports CTAS(Create table as select) on spark sql. <br/>
+Note: For better performance to load data to hudi table, CTAS uses the **bulk insert** as the write operation.
+
+Example CTAS command to create a partitioned, primary key COW table.
+
```sql
--- create a partitioned COW table
create table h2 using hudi
options (type = 'cow', primaryKey = 'id')
partitioned by (dt)
as
-select 1 as id, 'a1' as name, 10 as price, 1000 as dt
-;
+select 1 as id, 'a1' as name, 10 as price, 1000 as dt;
+```
+
+Example CTAS command to create a non-partitioned COW table.
--- create a non-partitioned COW table.
+```sql
create table h3 using hudi
as
-select 1 as id, 'a1' as name, 10 as price
-;
+select 1 as id, 'a1' as name, 10 as price;
+```
+
+Example CTAS command to load data from another table.
+
+```sql
+# create managed parquet table
+create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet';
+
+# CTAS by loading data into hudi table
+create table hudi_tbl using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
+ type = 'cow',
+ primaryKey = 'id',
+ preCombineField = 'ts'
+ )
+partitioned by (datestr) as select * from parquet_mngd;
```
**Create Table Options**
+Users can set table options while creating a hudi table. Critical options are listed here.
+
| Parameter Name | Introduction |
|------------|--------|
| primaryKey | The primary key names of the table, multiple fields separated by commas. |
| type | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.|
| preCombineField | The Pre-Combine field of the table. |
-To set custom hudi config, see the "Set hudi config section" would help.
+To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" .
+
+</TabItem>
+</Tabs>
+
## Insert data
@@ -941,10 +998,10 @@ e.g.
```sql
insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' as hh;
```
-## Other Spark Sql Command
+## More Spark Sql Commands
### AlterTable
-**Syntx**
+**Syntax**
```sql
-- Alter table name
ALTER TABLE oldTableName RENAME TO newTableName
@@ -955,18 +1012,18 @@ ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
```
-**Case**
+**Examples**
```sql
alter table h0 rename to h0_1;
alter table h0_1 add columns(ext0 string);
-alter tablee h0_1 change column id id bigint;
+alter table h0_1 change column id id bigint;
```
-## Set hudi config
+## Setting custom hudi configs
### Use set command
-You can use the **set** command to set the hudi's config, which will work for the
+You can use the **set** command to set any custom hudi's config, which will work for the
whole spark session scope.
```sql
set hoodie.insert.shuffle.parallelism = 100;
@@ -975,7 +1032,7 @@ set hoodie.delete.shuffle.parallelism = 100;
```
### Set with table options
-You can also set the config in the table's options when creating table which will work for
+You can also set the config with table options when creating table which will work for
the table scope only and override the config set by the SET command.
```sql
create table if not exists h3(
@@ -990,6 +1047,7 @@ options (
${hoodie.config.key2} = '${hoodie.config.value2}',
....
);
+
e.g.
create table if not exists h3(
id bigint,
@@ -1003,7 +1061,8 @@ options (
hoodie.keep.max.commits = '20'
);
```
-You can alter the config in the table options by the **ALTER SERDEPROPERTIES**
+
+You can also alter the write config for a table by the **ALTER SERDEPROPERTIES**
e.g.
```sql