You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by si...@apache.org on 2021/08/31 13:44:57 UTC
[hudi] branch asf-site updated: [HUDI-2381] Fixing quick start
guide (#3570)
This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new bf359ce [HUDI-2381] Fixing quick start guide (#3570)
bf359ce is described below
commit bf359cefc237b59fb18615bc36cd9490a2c56710
Author: Sivabalan Narayanan <si...@uber.com>
AuthorDate: Tue Aug 31 09:44:47 2021 -0400
[HUDI-2381] Fixing quick start guide (#3570)
---
website/docs/quick-start-guide.md | 175 ++++++++++++--------
.../version-0.9.0/quick-start-guide.md | 182 ++++++++++++---------
2 files changed, 212 insertions(+), 145 deletions(-)
diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md
index 36cd6ea..1a58297 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -14,8 +14,8 @@ After each write operation we will also show how to read the data both snapshot
## Setup
-Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [here](https://spark.apache.org/downloads) for setting up spark.
-From the extracted directory run spark-shell with Hudi as:
+Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [here](https://spark.apache.org/downloads) for setting up spark.
+With 0.9.0 release, spark-sql dml support has been added and is experimental.
<Tabs
defaultValue="scala"
@@ -26,6 +26,8 @@ values={[
]}>
<TabItem value="scala">
+From the extracted directory run spark-shell with Hudi as:
+
```scala
// spark-shell for spark 3
spark-shell \
@@ -47,6 +49,8 @@ spark-shell \
<TabItem value="sparksql">
Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
+From the extracted directory run spark-sql with Hudi as:
+
```shell
# spark sql for spark 3
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
@@ -68,6 +72,8 @@ spark-sql \
</TabItem>
<TabItem value="python">
+From the extracted directory run pyspark with Hudi as:
+
```python
# pyspark
export PYSPARK_PYTHON=$(which python3)
@@ -185,9 +191,9 @@ Spark-sql needs an explicit create table command.
In general, spark-sql supports two kinds of tables, namely managed and external. If one specifies a location using **location** statement, it is an external table, else its considered a managed table. You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/).
- Table with primary key:
- Users can choose to create a table with primary key if need be. Else table is considered a non-primary keyed table.
- If the user has specified the **primaryKey** column in options, table is considered to be a primary key table.
- If you are using any of the built-in key generators in Hudi, likely its a primary key table.
+ Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table.
+ One needs to set **primaryKey** column in options to create a primary key table.
+ If you are using any of the built-in key generators in Hudi, likely it is a primary key table.
Let's go over some of the create table commands.
@@ -212,7 +218,7 @@ Here is an example of creating an MOR external table (location needs to be speci
is used to specify the preCombine field for merge.
```sql
--- creae an external mor table
+-- create an external mor table
create table if not exists hudi_table1 (
id int,
name string,
@@ -227,7 +233,7 @@ options (
);
```
-Here is the example of creating a COW table without primary key.
+Here is an example of creating a COW table without primary key.
```sql
-- create a non-primary key table
@@ -333,8 +339,6 @@ To set any custom hudi config(like index type, max parquet size, etc), see the
## Insert data
-Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
-
<Tabs
defaultValue="scala"
values={[
@@ -344,6 +348,8 @@ values={[
]}>
<TabItem value="scala">
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
+
```scala
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
@@ -356,11 +362,21 @@ df.write.format("hudi").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
-```
-
+```
+:::info
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
+(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+:::
</TabItem>
<TabItem value="python">
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
```python
# pyspark
@@ -383,7 +399,16 @@ df.write.format("hudi").
mode("overwrite").
save(basePath)
```
-
+:::info
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
+(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+:::
</TabItem>
<TabItem value="sparksql">
@@ -406,45 +431,25 @@ insert overwrite table h0 select 1, 'a1', 20;
-- insert overwrite table with static partition
insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
-- insert overwrite table with dynamic partition
+-- insert overwrite table with dynamic partition
insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' as hh;
```
**NOTICE**
-1. Insert mode
-
-Hudi support three insert modes when inserting data to a table with primary key(we call it pk-table as followed):
-- upsert <br/>
- This it the default insert mode. For upsert mode, insert statement do the upsert operation for the pk-table which will
- update the duplicate record
-- strict <br/>
- For strict mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow duplicate record.
- If inserting a record which the primary key is already exists to the table, a HoodieDuplicateKeyException will throw out
- for COW table. For MOR table, it has the same behavior with "upsert" mode.
+- Insert mode : Hudi supports two insert modes when inserting data to a table with primary key(we call it pk-table as followed):<br/>
+ Using `strict` mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow
+ duplicate records. If a record already exists during insert, a HoodieDuplicateKeyException will be thrown
+ for COW table. For MOR table, updates are allowed to existing record.<br/>
+ Using `non-strict` mode, hudi uses the same code path used by `insert` operation in spark data source for the pk-table. <br/>
+ One can set the insert mode by using the config: **hoodie.sql.insert.mode**
-- non-strict <br/>
- For non-strict mode, hudi just do the insert operation for the pk-table.
-
- We can set the insert mode by using the config: **hoodie.sql.insert.mode**
-
-2. Bulk Insert <br/>
- By default, hudi uses the normal insert operation for insert statements. We can set **hoodie.sql.bulk.insert.enable**
+- Bulk Insert : By default, hudi uses the normal insert operation for insert statements. Users can set **hoodie.sql.bulk.insert.enable**
to true to enable the bulk insert for insert statement.
</TabItem>
</Tabs>
-:::info
-`mode(Overwrite)` overwrites and recreates the table if it already exists.
-You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
-(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
-[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
-[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
-and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
-Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
-:::
Checkout https://hudi.apache.org/blog/2021/02/13/hudi-key-generators for various key generator options, like Timestamp based,
complex, custom, NonPartitioned Key gen, etc.
@@ -477,7 +482,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat
### Time Travel Query
-Hudi support time travel query since 0.9.0. Currently three query time format are supported:
+Hudi support time travel query since 0.9.0. Currently three query time formats are supported as given below.
```scala
spark.read.
format("hudi").
@@ -497,7 +502,14 @@ spark.read.
```
-
+:::info
+Since 0.9.0 hudi has support a hudi built-in FileIndex: **HoodieFileIndex** to query hudi table,
+which supports partition pruning and metatable for query. This will help improve query performance.
+It also supports non-global query path which means users can query the table by the base path without
+specifing the "*" in the query path. This feature has enabled by default for the non-global query path.
+For the global query path, hudi uses the old query path.
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for more info on all table types and query types supported.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -522,18 +534,39 @@ spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
```
-</TabItem>
-</Tabs>
+### Time Travel Query
+
+Hudi support time travel query since 0.9.0. Currently three query time formats are supported as given below.
+```python
+#pyspark
+spark.read. \
+ format("hudi"). \
+ option("as.of.instant", "20210728141108"). \
+ load(basePath)
+
+spark.read. \
+ format("hudi"). \
+ option("as.of.instant", "2021-07-28 14: 11: 08"). \
+ load(basePath)
+
+// It is equal to "as.of.instant = 2021-07-28 00:00:00"
+spark.read. \
+ format("hudi"). \
+ option("as.of.instant", "2021-07-28"). \
+ load(basePath)
+```
:::info
Since 0.9.0 hudi has support a hudi built-in FileIndex: **HoodieFileIndex** to query hudi table,
-which has support partition prune and metatable for query. This will help improve query performance.
+which supports partition pruning and metatable for query. This will help improve query performance.
It also supports non-global query path which means users can query the table by the base path without
-specify the "*" in the query path.
-This feature has enabled by default for the non-global query path. For the global query path, we will
-rollback to the old query way.
+specifing the "*" in the query path. This feature has enabled by default for the non-global query path.
+For the global query path, hudi uses the old query path.
Refer to [Table types and queries](/docs/concepts#table-types--queries) for more info on all table types and query types supported.
:::
+</TabItem>
+</Tabs>
+
## Update data
@@ -562,7 +595,11 @@ df.write.format("hudi").
mode(Append).
save(basePath)
```
-
+:::note
+Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -612,18 +649,17 @@ when not matched then insert (id,name,price) values(id, name, price)
```
**Notice**
-1.The merge-on condition can be only on primary keys. Support to merge based on other fields will be added in future.
-2. Support for partial updates for Merge-On-Read table will be added in future.
+- The merge-on condition can be only on primary keys. Support to merge based on other fields will be added in future.
+- Support for partial updates is supported for cow table.
e.g.
```sql
merge into h0 using s0
on h0.id = s0.id
when matched then update set price = s0.price * 2
```
-This works well for Cow-On-Write table which support update only the **price** field.
-For Merge-ON-READ table this will be supported in the future.
-
-3、Target table's fields cannot be the right-value of the update expression for Merge-On-Read table.
+This works well for Cow-On-Write table which supports update based on the **price** field.
+For Merge-on-Read table this will be supported in the future.
+- Target table's fields cannot be the right-value of the update expression for Merge-On-Read table.
e.g.
```sql
merge into h0 using s0
@@ -632,7 +668,7 @@ e.g.
name = h0.name,
price = s0.price + h0.price
```
-This can work well for Cow-On-Write table, for Merge-ON-READ table this will be supported in the future.
+This works well for Cow-On-Write table, but not yet supported for Merge-On-Read table.
### Update
**Syntax**
@@ -657,15 +693,15 @@ df.write.format("hudi"). \
mode("append"). \
save(basePath)
```
+:::note
+Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
+:::
</TabItem>
</Tabs>
-:::note
-Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
-[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
-denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
-:::
## Incremental query
@@ -795,7 +831,6 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hud
</Tabs>
## Delete data {#deletes}
-Delete records for the HoodieKeys passed in.
<Tabs
defaultValue="scala"
@@ -805,6 +840,7 @@ values={[
{ label: 'SparkSQL', value: 'sparksql', },
]}>
<TabItem value="scala">
+Delete records for the HoodieKeys passed in.<br/>
```scala
// spark-shell
@@ -837,7 +873,9 @@ roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
// fetch should return (total - 2) records
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
```
-
+:::note
+Only `Append` mode is supported for delete operation.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -852,6 +890,7 @@ delete from h0 where id = 1;
</TabItem>
<TabItem value="python">
+Delete records for the HoodieKeys passed in.<br/>
```python
# pyspark
@@ -889,13 +928,12 @@ roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
# fetch should return (total - 2) records
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
```
-
-</TabItem>
-</Tabs>
-
:::note
Only `Append` mode is supported for delete operation.
:::
+</TabItem>
+</Tabs>
+
See the [deletion section](/docs/writing_data#deletes) of the writing data page for more details.
@@ -944,7 +982,6 @@ spark.
</TabItem>
<TabItem value="sparksql">
-**NOTICE**
The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.
e.g.
@@ -1003,7 +1040,6 @@ spark.
</TabItem>
<TabItem value="sparksql">
-**NOTICE**
The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation.
e.g.
@@ -1036,7 +1072,6 @@ alter table h0_1 add columns(ext0 string);
alter table h0_1 change column id id bigint;
```
-## Setting custom hudi configs
### Use set command
You can use the **set** command to set any custom hudi's config, which will work for the
whole spark session scope.
diff --git a/website/versioned_docs/version-0.9.0/quick-start-guide.md b/website/versioned_docs/version-0.9.0/quick-start-guide.md
index 36cd6ea..35e3a28 100644
--- a/website/versioned_docs/version-0.9.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.9.0/quick-start-guide.md
@@ -14,8 +14,8 @@ After each write operation we will also show how to read the data both snapshot
## Setup
-Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [here](https://spark.apache.org/downloads) for setting up spark.
-From the extracted directory run spark-shell with Hudi as:
+Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [here](https://spark.apache.org/downloads) for setting up spark.
+With 0.9.0 release, spark-sql dml support has been added and is experimental.
<Tabs
defaultValue="scala"
@@ -25,6 +25,7 @@ values={[
{ label: 'SparkSQL', value: 'sparksql', },
]}>
<TabItem value="scala">
+From the extracted directory run spark-shell with Hudi as:
```scala
// spark-shell for spark 3
@@ -47,6 +48,7 @@ spark-shell \
<TabItem value="sparksql">
Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
+From the extracted directory run spark-sql with Hudi as:
```shell
# spark sql for spark 3
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
@@ -67,6 +69,7 @@ spark-sql \
</TabItem>
<TabItem value="python">
+From the extracted directory run pyspark with Hudi as:
```python
# pyspark
@@ -185,9 +188,9 @@ Spark-sql needs an explicit create table command.
In general, spark-sql supports two kinds of tables, namely managed and external. If one specifies a location using **location** statement, it is an external table, else its considered a managed table. You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/).
- Table with primary key:
- Users can choose to create a table with primary key if need be. Else table is considered a non-primary keyed table.
- If the user has specified the **primaryKey** column in options, table is considered to be a primary key table.
- If you are using any of the built-in key generators in Hudi, likely its a primary key table.
+ Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table.
+ One needs to set **primaryKey** column in options to create a primary key table.
+ If you are using any of the built-in key generators in Hudi, likely it is a primary key table.
Let's go over some of the create table commands.
@@ -212,7 +215,7 @@ Here is an example of creating an MOR external table (location needs to be speci
is used to specify the preCombine field for merge.
```sql
--- creae an external mor table
+-- create an external mor table
create table if not exists hudi_table1 (
id int,
name string,
@@ -227,7 +230,7 @@ options (
);
```
-Here is the example of creating a COW table without primary key.
+Here is an example of creating a COW table without primary key.
```sql
-- create a non-primary key table
@@ -333,8 +336,6 @@ To set any custom hudi config(like index type, max parquet size, etc), see the
## Insert data
-Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
-
<Tabs
defaultValue="scala"
values={[
@@ -344,6 +345,8 @@ values={[
]}>
<TabItem value="scala">
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
+
```scala
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
@@ -356,11 +359,21 @@ df.write.format("hudi").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
-```
-
+```
+:::info
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
+(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+:::
</TabItem>
<TabItem value="python">
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
```python
# pyspark
@@ -383,6 +396,16 @@ df.write.format("hudi").
mode("overwrite").
save(basePath)
```
+:::info
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
+(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+:::
</TabItem>
@@ -406,46 +429,24 @@ insert overwrite table h0 select 1, 'a1', 20;
-- insert overwrite table with static partition
insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
-- insert overwrite table with dynamic partition
+-- insert overwrite table with dynamic partition
insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' as hh;
```
**NOTICE**
-
-1. Insert mode
-
-Hudi support three insert modes when inserting data to a table with primary key(we call it pk-table as followed):
-- upsert <br/>
- This it the default insert mode. For upsert mode, insert statement do the upsert operation for the pk-table which will
- update the duplicate record
-- strict <br/>
- For strict mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow duplicate record.
- If inserting a record which the primary key is already exists to the table, a HoodieDuplicateKeyException will throw out
- for COW table. For MOR table, it has the same behavior with "upsert" mode.
-
-- non-strict <br/>
- For non-strict mode, hudi just do the insert operation for the pk-table.
-
- We can set the insert mode by using the config: **hoodie.sql.insert.mode**
-
-2. Bulk Insert <br/>
- By default, hudi uses the normal insert operation for insert statements. We can set **hoodie.sql.bulk.insert.enable**
- to true to enable the bulk insert for insert statement.
-
+- Insert mode : Hudi supports two insert modes when inserting data to a table with primary key(we call it pk-table as followed): <br/>
+ Using `strict` mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow
+ duplicate records. If a record already exists during insert, a HoodieDuplicateKeyException will be thrown
+ for COW table. For MOR table, updates are allowed to existing record.<br/>
+ Using `non-strict` mode, hudi uses the same code path used by `insert` operation in spark data source for the pk-table.<br/>
+ One can set the insert mode by using the config: **hoodie.sql.insert.mode**
+
+- Bulk Insert : By default, hudi uses the normal insert operation for insert statements. Users can set **hoodie.sql.bulk.insert.enable**
+ to true to enable the bulk insert for insert statement.
+
</TabItem>
</Tabs>
-:::info
-`mode(Overwrite)` overwrites and recreates the table if it already exists.
-You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
-(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in
-[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to
-[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
-and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
-Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
-:::
-
Checkout https://hudi.apache.org/blog/2021/02/13/hudi-key-generators for various key generator options, like Timestamp based,
complex, custom, NonPartitioned Key gen, etc.
@@ -477,7 +478,8 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat
### Time Travel Query
-Hudi support time travel query since 0.9.0. Currently three query time format are supported:
+Hudi support time travel query since 0.9.0. Currently three query time formats are supported as given below.
+
```scala
spark.read.
format("hudi").
@@ -497,7 +499,14 @@ spark.read.
```
-
+:::info
+Since 0.9.0 hudi has support a hudi built-in FileIndex: **HoodieFileIndex** to query hudi table,
+which supports partition pruning and metatable for query. This will help improve query performance.
+It also supports non-global query path which means users can query the table by the base path without
+specifing the "*" in the query path. This feature has enabled by default for the non-global query path.
+For the global query path, hudi uses the old query path.
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for more info on all table types and query types supported.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -522,18 +531,38 @@ spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
```
-</TabItem>
-</Tabs>
+### Time Travel Query
+
+Hudi support time travel query since 0.9.0. Currently three query time formats are supported as given below.
+
+```python
+spark.read.
+ format("hudi").
+ option("as.of.instant", "20210728141108").
+ load(basePath)
+
+spark.read.
+ format("hudi").
+ option("as.of.instant", "2021-07-28 14: 11: 08").
+ load(basePath)
+
+// It is equal to "as.of.instant = 2021-07-28 00:00:00"
+spark.read.
+ format("hudi").
+ option("as.of.instant", "2021-07-28").
+ load(basePath)
+```
:::info
Since 0.9.0 hudi has support a hudi built-in FileIndex: **HoodieFileIndex** to query hudi table,
-which has support partition prune and metatable for query. This will help improve query performance.
+which supports partition pruning and metatable for query. This will help improve query performance.
It also supports non-global query path which means users can query the table by the base path without
-specify the "*" in the query path.
-This feature has enabled by default for the non-global query path. For the global query path, we will
-rollback to the old query way.
+specifing the "*" in the query path. This feature has enabled by default for the non-global query path.
+For the global query path, hudi uses the old query path.
Refer to [Table types and queries](/docs/concepts#table-types--queries) for more info on all table types and query types supported.
:::
+</TabItem>
+</Tabs>
## Update data
@@ -562,7 +591,11 @@ df.write.format("hudi").
mode(Append).
save(basePath)
```
-
+:::note
+Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -612,19 +645,18 @@ when not matched then insert (id,name,price) values(id, name, price)
```
**Notice**
-1.The merge-on condition can be only on primary keys. Support to merge based on other fields will be added in future.
-2. Support for partial updates for Merge-On-Read table will be added in future.
+- The merge-on condition can be only on primary keys. Support to merge based on other fields will be added in future.
+- Support for partial updates is supported for cow table.
e.g.
```sql
merge into h0 using s0
on h0.id = s0.id
when matched then update set price = s0.price * 2
```
-This works well for Cow-On-Write table which support update only the **price** field.
-For Merge-ON-READ table this will be supported in the future.
-
-3、Target table's fields cannot be the right-value of the update expression for Merge-On-Read table.
-e.g.
+This works well for Cow-On-Write table which supports update based on the **price** field.
+For Merge-on-Read table this will be supported in the future.
+- Target table's fields cannot be the right-value of the update expression for Merge-On-Read table.
+ e.g.
```sql
merge into h0 using s0
on h0.id = s0.id
@@ -632,7 +664,7 @@ e.g.
name = h0.name,
price = s0.price + h0.price
```
-This can work well for Cow-On-Write table, for Merge-ON-READ table this will be supported in the future.
+This can work well for Cow-On-Write table, but not yet supported for Merge-On-Read table.
### Update
**Syntax**
@@ -657,15 +689,14 @@ df.write.format("hudi"). \
mode("append"). \
save(basePath)
```
-
-</TabItem>
-</Tabs>
-
:::note
Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
-[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
-denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
+[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](/docs/concepts)
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit.
:::
+</TabItem>
+</Tabs>
+
## Incremental query
@@ -795,7 +826,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hud
</Tabs>
## Delete data {#deletes}
-Delete records for the HoodieKeys passed in.
+
<Tabs
defaultValue="scala"
@@ -805,7 +836,7 @@ values={[
{ label: 'SparkSQL', value: 'sparksql', },
]}>
<TabItem value="scala">
-
+Delete records for the HoodieKeys passed in.<br/>
```scala
// spark-shell
// fetch total records count
@@ -838,6 +869,9 @@ roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
</TabItem>
<TabItem value="sparksql">
@@ -852,6 +886,7 @@ delete from h0 where id = 1;
</TabItem>
<TabItem value="python">
+Delete records for the HoodieKeys passed in.<br/>
```python
# pyspark
@@ -889,13 +924,13 @@ roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
# fetch should return (total - 2) records
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
</TabItem>
</Tabs>
-:::note
-Only `Append` mode is supported for delete operation.
-:::
See the [deletion section](/docs/writing_data#deletes) of the writing data page for more details.
@@ -944,7 +979,6 @@ spark.
</TabItem>
<TabItem value="sparksql">
-**NOTICE**
The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.
e.g.
@@ -1003,7 +1037,6 @@ spark.
</TabItem>
<TabItem value="sparksql">
-**NOTICE**
The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation.
e.g.
@@ -1036,7 +1069,6 @@ alter table h0_1 add columns(ext0 string);
alter table h0_1 change column id id bigint;
```
-## Setting custom hudi configs
### Use set command
You can use the **set** command to set any custom hudi's config, which will work for the
whole spark session scope.