You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by si...@apache.org on 2021/08/30 02:31:29 UTC

[hudi] branch asf-site updated: [HUDI-2373] Fixing 0.9.0 spark quick start page for spark-sql tabs (#3557)

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 7347a92  [HUDI-2373] Fixing 0.9.0 spark quick start page for spark-sql tabs (#3557)
7347a92 is described below

commit 7347a92ab11a14ff4c938a638d2e6754323eb5a4
Author: Udit Mehrotra <ud...@gmail.com>
AuthorDate: Sun Aug 29 19:31:16 2021 -0700

    [HUDI-2373] Fixing 0.9.0 spark quick start page for spark-sql tabs (#3557)
---
 .../version-0.9.0/quick-start-guide.md             | 122 ++++++++++++---------
 1 file changed, 73 insertions(+), 49 deletions(-)

diff --git a/website/versioned_docs/version-0.9.0/quick-start-guide.md b/website/versioned_docs/version-0.9.0/quick-start-guide.md
index 39cf458..36cd6ea 100644
--- a/website/versioned_docs/version-0.9.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.9.0/quick-start-guide.md
@@ -359,8 +359,35 @@ df.write.format("hudi").
 ``` 
 
 </TabItem>
+
+<TabItem value="python">
+
+```python
+# pyspark
+inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
+df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+
+hudi_options = {
+    'hoodie.table.name': tableName,
+    'hoodie.datasource.write.recordkey.field': 'uuid',
+    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
+    'hoodie.datasource.write.table.name': tableName,
+    'hoodie.datasource.write.operation': 'upsert',
+    'hoodie.datasource.write.precombine.field': 'ts',
+    'hoodie.upsert.shuffle.parallelism': 2,
+    'hoodie.insert.shuffle.parallelism': 2
+}
+
+df.write.format("hudi").
+    options(**hudi_options).
+    mode("overwrite").
+    save(basePath)
+```
+
+</TabItem>
+
 <TabItem value="sparksql">
-    
+
 ```sql
 insert into h0 select 1, 'a1', 20;
 
@@ -388,49 +415,22 @@ insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
 1. Insert mode
 
 Hudi support three insert modes when inserting data to a table with primary key(we call it pk-table as followed):
-- upsert
-  
-  This it the default insert mode. For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record
-- strict
-
-For strict mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow duplicate record.
-If inserting a record which the primary key is already exists to the table, a HoodieDuplicateKeyException will throw out
-for COW table. For MOR table, it has the same behavior with "upsert" mode.
-
-- non-strict
-
-For non-strict mode, hudi just do the insert operation for the pk-table.
+- upsert <br/>
+  This it the default insert mode. For upsert mode, insert statement do the upsert operation for the pk-table which will 
+  update the duplicate record
+- strict <br/>
+  For strict mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow duplicate record.
+  If inserting a record which the primary key is already exists to the table, a HoodieDuplicateKeyException will throw out
+  for COW table. For MOR table, it has the same behavior with "upsert" mode.
 
-We can set the insert mode by the config: **hoodie.sql.insert.mode**
+- non-strict <br/>
+  For non-strict mode, hudi just do the insert operation for the pk-table.
 
-2. Bulk Insert
-By default, hudi use the normal insert operation for insert statement. We can set **hoodie.sql.bulk.insert.enable** to true to enable 
-the bulk insert for insert statement.
-   
-</TabItem>
-<TabItem value="python">
-
-```python
-# pyspark
-inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
-df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
-
-hudi_options = {
-    'hoodie.table.name': tableName,
-    'hoodie.datasource.write.recordkey.field': 'uuid',
-    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
-    'hoodie.datasource.write.table.name': tableName,
-    'hoodie.datasource.write.operation': 'upsert',
-    'hoodie.datasource.write.precombine.field': 'ts',
-    'hoodie.upsert.shuffle.parallelism': 2,
-    'hoodie.insert.shuffle.parallelism': 2
-}
+  We can set the insert mode by using the config: **hoodie.sql.insert.mode**
 
-df.write.format("hudi").
-    options(**hudi_options).
-    mode("overwrite").
-    save(basePath)
-```
+2. Bulk Insert <br/>
+   By default, hudi uses the normal insert operation for insert statements. We can set **hoodie.sql.bulk.insert.enable** 
+   to true to enable the bulk insert for insert statement.
 
 </TabItem>
 </Tabs>
@@ -566,11 +566,9 @@ df.write.format("hudi").
 </TabItem>
 <TabItem value="sparksql">
 
-Spark sql support two kinds of DML to udpate hudi table: Merge-Into and Update.
-
-###MergeInto
+Spark sql supports two kinds of DML to update hudi table: Merge-Into and Update.
 
-Hudi support merge-into for both spark 2 & spark 3.
+### MergeInto
 
 **Syntax**
 
@@ -591,7 +589,7 @@ ON <merge_condition>
   INSERT *  |
   INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
 ```
-**Case**
+**Example**
 ```sql
 merge into h0 as target
 using (
@@ -614,8 +612,8 @@ when not matched then insert (id,name,price) values(id, name, price)
 ```
 **Notice**
 
-1、The merge-on condition must be the primary keys currently.
-2、Merge-On-Read table has not support partial update.
+1.The merge-on condition can be only on primary keys. Support to merge based on other fields will be added in future.  
+2. Support for partial updates for Merge-On-Read table will be added in future.
 e.g.
 ```sql
  merge into h0 using s0
@@ -847,7 +845,7 @@ spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
 ```sql
  DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
 ```
-**Case**
+**Example**
 ```sql
 delete from h0 where id = 1;
 ```
@@ -907,6 +905,14 @@ Generate some new trips, overwrite the table logically at the Hudi metadata leve
 clean up the previous table snapshot's file groups. This can be faster than deleting the older table and recreating 
 in `Overwrite` mode.
 
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'SparkSQL', value: 'sparksql', },
+]}>
+<TabItem value="scala">
+
 ```scala
 // spark-shell
 spark.
@@ -935,7 +941,9 @@ spark.
   show(10, false)
 
 ``` 
+</TabItem>
 
+<TabItem value="sparksql">
 **NOTICE**
 
 The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.
@@ -943,6 +951,8 @@ e.g.
 ```sql
 insert overwrite table h0 select 1, 'a1', 20;
 ```
+</TabItem>
+</Tabs>
 
 ## Insert Overwrite 
 
@@ -951,6 +961,14 @@ than `upsert` for batch ETL jobs, that are recomputing entire target partitions
 updating the target tables). This is because, we are able to bypass indexing, precombining and other repartitioning 
 steps in the upsert write path completely.
 
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'SparkSQL', value: 'sparksql', },
+]}>
+<TabItem value="scala">
+
 ```scala
 // spark-shell
 spark.
@@ -982,6 +1000,9 @@ spark.
   sort("partitionpath","uuid").
   show(100, false)
 ```
+</TabItem>
+
+<TabItem value="sparksql">
 **NOTICE**
 
 The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation.
@@ -989,6 +1010,9 @@ e.g.
 ```sql
 insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' as hh;
 ```
+</TabItem>
+</Tabs>
+
 ## More Spark Sql Commands
 
 ### AlterTable