You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by xu...@apache.org on 2022/03/04 04:03:16 UTC

[hudi] branch master updated: [HUDI-3161][RFC-47] Add Call Produce Command for Spark SQL (#4607)

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new 6faed3d  [HUDI-3161][RFC-47] Add Call Produce Command for Spark SQL (#4607)
6faed3d is described below

commit 6faed3d90ab99844017465e63c6a0604bdc3a68c
Author: ForwardXu <fo...@gmail.com>
AuthorDate: Fri Mar 4 12:02:46 2022 +0800

    [HUDI-3161][RFC-47] Add Call Produce Command for Spark SQL (#4607)
---
 rfc/README.md          |   1 +
 rfc/rfc-46/rfc-46.md   |   2 +-
 rfc/rfc-47/process.png | Bin 0 -> 54876 bytes
 rfc/rfc-47/rfc-47.md   | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/rfc/README.md b/rfc/README.md
index a9587d1..0b3142e 100644
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -70,3 +70,4 @@ The list of all RFCs can be found here.
 | 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md) | `UNDER REVIEW` |
 | 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md) | `UNDER REVIEW` |
 | 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md) | `UNDER REVIEW` |
+| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md) | `UNDER REVIEW` |
diff --git a/rfc/rfc-46/rfc-46.md b/rfc/rfc-46/rfc-46.md
index 8b0feff..92aed55 100644
--- a/rfc/rfc-46/rfc-46.md
+++ b/rfc/rfc-46/rfc-46.md
@@ -156,4 +156,4 @@ This refactoring will not be modifying any existing Hudi semantics other than th
 logical correctness of the many flows that will be affected by the refactoring we will rely on the existing set of test-suites.
 
 Nevertheless, we will run corresponding set of benchmarks stressing the flows being affected by the refactoring to validate
-that there are considerable performance advantage of abandoning conversion into intermediate representation completely.
\ No newline at end of file
+that there are considerable performance advantage of abandoning conversion into intermediate representation completely.
diff --git a/rfc/rfc-47/process.png b/rfc/rfc-47/process.png
new file mode 100644
index 0000000..0553fea
Binary files /dev/null and b/rfc/rfc-47/process.png differ
diff --git a/rfc/rfc-47/rfc-47.md b/rfc/rfc-47/rfc-47.md
new file mode 100644
index 0000000..f9017ec
--- /dev/null
+++ b/rfc/rfc-47/rfc-47.md
@@ -0,0 +1,111 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-[47]: Add RFC for Add Call Produce Command for Spark SQL
+
+
+## Proposers
+- @forwardxu
+
+## Approvers
+ - @vinoth @Danny @Raymond 
+
+## Status
+
+JIRA: [https://issues.apache.org/jira/browse/HUDI-3161](https://issues.apache.org/jira/browse/HUDI-3161)
+
+## Abstract
+
+We intend to implement Call Procedure Command for Spark SQL.
+
+## Background
+The Call Procedure Command can solve the problems that DDL and DML functions cannot handle. What I can think of are the following 4 aspects:
+- Commit management
+- Metadata table management
+- Table migration
+- Optimization table(Such as Table services like clustering)
+
+## Implementation
+
+### Extended SQL for CALL Command Syntax
+#### Named arguments
+All procedure arguments are named. When passing arguments by name, arguments can be in any order and any optional argument can be omitted.
+```
+CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1, ... arg_name_n => arg_n)
+```
+#### Positional arguments
+When passing arguments by position, the arguments may be omitted if they are optional.
+```
+CALL system.procedure_name(arg_1, arg_2, ... arg_n)
+```
+*note:* The system here has no practical meaning, the complete procedure name is system.procedure_name.
+
+![](process.png)
+
+As shown in the figure above, The execution process of Call Command consists of two parts, SQL Parser and Procedure Run.
+
+### parse
+In the sql parse stage, we will inject a HoodieSqlParser to spark sql which will parse our extended `CALL` syntax to LogicalPlan. If the HoodieSqlParser failed to parse the sql statement, Spark will route it to Spark SQL parser. So we just need to implement our extended syntax in `HoodieSqlParser`.
+
+### resolution
+In the resolution stage, some Hudi resolution rules will be injected to Spark SQL to resolve our extended LogicalPlan to the resolve plan which is a command plan for `CALL`.
+
+### procedure#call
+The `procedure#call` method will translate the logical plan to Hudi's API call. For example, the `ShowCommitsProcedure` will be translated to Hudi API for showing Hudi timeline's commits.
+
+### examples
+Example 1:
+
+```
+call show_commits_metadata(table => 'test_hudi_table');
+```
+
+|    commit_time   | action |	 partition  |               	file_id              |	previous_commit |	num_writes | num_inserts | num_deletes | num_update_writes | total_errors | total_log_blocks | total_corrupt_logblocks | total_rollback_blocks | total_log_records | total_updated_records_compacted | total_bytes_written|
+|----------------- |  ----  |  ------------ |----------------------------------------| ---------------- |------------- |-------------|-------------|------------------ |------------- |----------------  |-------------------------|---------------------- |-------------------|-------------------------------  |------------------- |
+|20220109225319449 | commit | dt=2021-05-03	| d0073a12-085d-4f49-83e9-402947e7e90a-0 | null             | 	1          | 	1        | 	0          | 0                 | 0            | 0				 | 0					   | 0                     | 0                 | 0                               | 435349             |
+|20220109225311742 | commit	| dt=2021-05-02	| b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 | 20220109214830592| 	1          | 	1        | 	0	       | 0	               | 0            | 0				 | 0					   | 0                     | 0                 | 0                               | 435340             |
+|20220109225301429 | commit	| dt=2021-05-01	| 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 | 20220109214830592| 	1          | 	1        | 	0          | 0                 | 0	          | 0				 | 0					   | 0                     | 0                 | 0                               | 435340             |
+|20220109214830592 | commit	| dt=2021-05-01	| 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 | 20220109191631015| 	0          | 	0        | 	1	       | 0	               | 0            | 0				 | 0					   | 0                     | 0                 | 0                               | 432653             |
+|20220109214830592 | commit	| dt=2021-05-02	| b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 | 20220109191648181| 	0          | 	0        | 	1	       | 0	               | 0	          | 0				 | 0					   | 0                     | 0                 | 0                               | 432653             |
+|20220109191648181 | commit	| dt=2021-05-02	| b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 | null             | 	1          | 	1        | 	0          | 0                 | 0            | 0				 | 0					   | 0                     | 0                 | 0                               | 435341             |
+|20220109191631015 | commit	| dt=2021-05-01	| 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 | null             | 	1          | 	1        | 	0          | 0                 | 0            | 0				 | 0					   | 0                     | 0                 | 0                               | 435341             |
+
+Time taken: 0.844 seconds, Fetched 7 row(s)
+
+Example 2:
+
+```
+call rollback_to_instant(table => 'test_hudi_table', instant_time => '20220109225319449');
+```
+
+| rollback_result |
+| :---------------|
+|    true         |
+
+Time taken: 5.038 seconds, Fetched 1 row(s)
+
+### Spark SQL permissions and security
+For security, please refer to the description of [Spark Security](https://spark.apache.org/docs/latest/security.html).
+
+## Rollout/Adoption Plan
+This is a new feature can use Spark SQL works And does not depend on the Spark version.
+
+## Test Plan
+
+- [☑️️] Unit tests for this feature
+- [✖️] Product integration tests
+- [✖️] Benchmark snapshot query for large tables