You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/11 11:53:22 UTC

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #3140: [HUDI-2063] Add Doc For Spark Sql Integrates With Hudi

pengzhiwei2018 commented on a change in pull request #3140:
URL: https://github.com/apache/hudi/pull/3140#discussion_r686733674



##########
File path: website/docs/querying_data.md
##########
@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet reader instead of Hive Ser
 both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. 
 This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). 
 
+**NOTICE**
+
+Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So we do not need the `spark.sql.hive.convertMetastoreParquet=false`

Review comment:
       Well, for an existing old table,  the hive sync will not update the properties of spark datasource table to the table. So they still need this config when querying.

##########
File path: website/docs/quick-start-guide.md
##########
@@ -263,6 +451,88 @@ df.write.format("hudi").
   save(basePath)
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Spark sql support two kinds of DML to udpate hudi table: Merge-Into and Update.
+
+###MergeInto
+
+Hudi support merge-into for both spark 2 & spark 3.
+
+**Syntax**
+
+```sql
+MERGE INTO tableIdentifier AS target_alias
+USING (sub_query | tableIdentifier) AS source_alias
+ON <merge_condition>
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
+
+<merge_condition> =A equal bool condition 
+<matched_action>  =
+  DELETE  |
+  UPDATE SET *  |
+  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
+<not_matched_action>  =
+  INSERT *  |
+  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
+```
+**Case**
+```sql
+merge into h0 as target
+using (
+  select id, name, price, flag from s
+) source
+on target.id = source.id
+when matched then update set *
+when not matched then insert *
+;
+
+merge into h0
+using (
+  select id, name, price, flag from s
+) source
+on h0.id = source.id
+when matched and flag != 'delete' then update set id = source.id, name = source.name, price = source.price * 2
+when matched and flag = 'delete' then delete
+when not matched then insert (id,name,price) values(id, name, price)
+;
+```
+**Notice**
+
+1、The merge-on condition must be the primary keys.
+2、Merge-On-Read table do not support partial. e.g.
+```sql
+ merge into h0 using s0
+ on h0.id = s0.id
+ when matched then update set price = s0.price * 2
+```
+This works well for Cow-On-Write table which support update only the **price** field. But it do not work
+for Merge-ON-READ table.

Review comment:
       Make sense.

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
+  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \

Review comment:
       OK

##########
File path: website/docs/querying_data.md
##########
@@ -141,14 +146,13 @@ and executors. Alternatively, hudi-spark-bundle can also fetched via the `--pack
 
 ### Snapshot query {#spark-snap-query}
 This method can be used to retrieve the data table at the present point in time.
-Note: The file path must be suffixed with a number of wildcard asterisk (`/*`) one greater than the number of partition levels. Eg: with table file path "tablePath" partitioned by columns "a", "b", and "c", the load path must be `tablePath + "/*/*/*/*"`
 
 ```scala
 val hudiIncQueryDF = spark
      .read()
-     .format("org.apache.hudi")
+     .format("hudi")
      .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
-     .load(tablePath + "/*") //The number of wildcard asterisks here must be one greater than the number of partition
+     .load(tablePath) 

Review comment:
       Yes, since 0.9.0 version, we support HoodieFileIndex, which does not need the "*"

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
+  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.12
 spark-shell \
-  --packages org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4 \
+  --packages org.apache.hudi:hudi-spark-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:2.4.4 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.11
 spark-shell \
-  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4 \
+  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \

Review comment:
       Ok

##########
File path: website/docs/quick-start-guide.md
##########
@@ -146,6 +249,55 @@ df.write.format("hudi").
   save(basePath)
 ``` 
 
+</TabItem>
+<TabItem value="sparksql">
+    
+```sql
+insert into h0 select 1, 'a1', 20;
+
+-- insert static partition
+insert into h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+-- insert dynamic partition
+insert into h_p0 select 1, 'a1', dt;
+
+-- insert dynamic partition
+insert into h_p1 select 1 as id, 'a1', '2021-01-03' as dt, '19' as hh;
+
+-- insert overwrite table
+insert overwrite table h0 select 1, 'a1', 20;
+
+-- insert overwrite table with static partition
+insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+- insert overwrite table with dynamic partition
+  insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' as hh;
+```
+
+**NOTICE**
+
+1. Insert mode
+
+Hudi support three insert modes when inserting data to a table with primary key(we call it pk-table as followed):
+- upsert
+  
+  This it the default insert mode. For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record
+- strict
+
+For strict mode, insert statement will keep the primary key uniqueness constraint for COW table which do not allow duplicate record.
+If inserting a record which the primary key is already exists to the table, a HoodieDuplicateKeyException will throw out
+for COW table. For MOR table, it has the same behavior with "upsert" mode.
+
+- non-strict
+
+For non-strict mode, hudi just do the insert operation for the pk-table.
+
+We can set the inset mode by the config: **hoodie.sql.insert.mode**

Review comment:
       done!

##########
File path: website/docs/quick-start-guide.md
##########
@@ -533,12 +816,16 @@ df.write.format("hudi").
 // Should have different keys now, from query before.
 spark.
   read.format("hudi").
-  load(basePath + "/*/*/*/*").
+  load(basePath).
   select("uuid","partitionpath").
   show(10, false)
 
 ``` 
 
+**NOTICE**
+
+The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.

Review comment:
       make sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org