You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2021/07/21 02:39:14 UTC
[spark] branch master updated: [SPARK-36153][SQL][DOCS] Update
transform doc to match the current code
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 305d563 [SPARK-36153][SQL][DOCS] Update transform doc to match the current code
305d563 is described below
commit 305d563329bfb7a4ef582655b88a72d826a4e8aa
Author: Angerszhuuuu <an...@gmail.com>
AuthorDate: Tue Jul 20 21:38:37 2021 -0500
[SPARK-36153][SQL][DOCS] Update transform doc to match the current code
### What changes were proposed in this pull request?
Update trasform's doc to latest code.
![image](https://user-images.githubusercontent.com/46485123/126175747-672cccbc-4e42-440f-8f1e-f00b6dc1be5f.png)
### Why are the changes needed?
keep consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes #33362 from AngersZhuuuu/SPARK-36153.
Lead-authored-by: Angerszhuuuu <an...@gmail.com>
Co-authored-by: AngersZhuuuu <an...@gmail.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
docs/sql-ref-syntax-qry-select-transform.md | 101 ++++++++++++++++++++++++----
1 file changed, 88 insertions(+), 13 deletions(-)
diff --git a/docs/sql-ref-syntax-qry-select-transform.md b/docs/sql-ref-syntax-qry-select-transform.md
index 21966f2..5a38e14 100644
--- a/docs/sql-ref-syntax-qry-select-transform.md
+++ b/docs/sql-ref-syntax-qry-select-transform.md
@@ -24,6 +24,15 @@ license: |
The `TRANSFORM` clause is used to specify a Hive-style transform query specification
to transform the inputs by running a user-specified command or script.
+Spark's script transform supports two modes:
+
+ 1. Hive support disabled: Spark script transform can run without `spark.sql.catalogImplementation=true`
+ or `SparkSession.builder.enableHiveSupport()`. In this case, now Spark only uses the script transform with
+ `ROW FORMAT DELIMITED` and treats all values passed to the script as strings.
+ 2. Hive support enabled: When Spark is run with `spark.sql.catalogImplementation=true` or Spark SQL is started
+ with `SparkSession.builder.enableHiveSupport()`, Spark can use the script transform with both Hive SerDe and
+ `ROW FORMAT DELIMITED`.
+
### Syntax
```sql
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
Specifies a command or a path to script to process data.
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than the number of specified output columns,
-the output columns will only select the corresponding columns and the remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be `key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the `value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`.
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses the character `\u0001` as the default field delimiter and this delimiter can be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses the character `\n` as the default line delimiter and this delimiter can be overridden by `LINES TERMINATED BY`.
+ - Spark uses a string `\N` as the default `NULL` value in order to differentiate `NULL` values
+ from the literal string `NULL`. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combines columns by tabs before feeding to the user script.
+ For complex types such as `ARRAY`/`MAP`/`STRUCT`, Spark uses `to_json` casts it to an input `JSON` string and uses
+ `from_json` to convert the result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are delimiters to split complex data such as
+ `ARRAY`/`MAP`/`STRUCT`, Spark uses `to_json` and `from_json` to handle complex data types with `JSON` format. So
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` columns. Any cell containing only a string `\N`
+ is re-interpreted as a literal `NULL` value, and then the resulting `STRING` column will be cast to the data types specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified output columns,
+ additional output columns will be filled with `NULL`. For example:
+ ```
+ output tabs: 1, 2
+ output columns: A: INT, B INT, C: INT
+ result:
+ +---+---+------+
+ | a| b| c|
+ +---+---+------+
+ | 1| 2| NULL|
+ +---+---+------+
+ ```
+ - If the actual number of output columns is more than the number of specified output columns,
+ the output columns only select the corresponding columns, and the remaining part will be discarded.
+ For example, if the output has three tabs and there are only two output columns:
+ ```
+ output tabs: 1, 2, 3
+ output columns: A: INT, B INT
+ result:
+ +---+---+
+ | a| b|
+ +---+---+
+ | 1| 2|
+ +---+---+
+ ```
+ - If there is no `AS` clause after `USING my_script`, the output schema is `key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the `value` column contains the remaining characters after the first tab.
+ If there are no tabs, Spark returns the `NULL` value. For example:
+ ```
+ output tabs: 1, 2, 3
+ output columns:
+ result:
+ +-----+-------+
+ | key| value|
+ +-----+-------+
+ | 1| 2|
+ +-----+-------+
+
+ output tabs: 1, 2
+ output columns:
+ result:
+ +-----+-------+
+ | key| value|
+ +-----+-------+
+ | 1| NULL|
+ +-----+-------+
+ ```
+
+### Hive SerDe behavior
+
+When Hive support is enabled and Hive SerDe mode is used:
+ - Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are cast
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All literal `NULL` values are converted to a string `\N` in order to differentiate literal `NULL` values from the literal string `NULL`.
+ - The standard output of the user script is treated as tab-separated `STRING` columns, any cell containing only a string `\N` is re-interpreted
+ as a `NULL` value, and then the resulting STRING column will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified output columns,
+ additional output columns will be filled with `NULL`.
+ - If the actual number of output columns is more than the number of specified output columns,
+ the output columns only select the corresponding columns, and the remaining part will be discarded.
+ - If there is no `AS` clause after `USING my_script`, the output schema is `key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the `value` column contains the remaining characters after the first tab.
+ If there is no tab, Spark returns the `NULL` value.
+ - These defaults can be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`.
### Examples
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org