You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2021/07/13 12:30:41 UTC
[spark] branch branch-3.1 updated: [SPARK-36081][SPARK-36066][SQL]
Update the document about the behavior change of trimming characters for
cast
This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.1 by this push:
new e6e2ce1 [SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast
e6e2ce1 is described below
commit e6e2ce174e5c6723207a8bd73f5b474ecb7a790f
Author: Kousuke Saruta <sa...@oss.nttdata.com>
AuthorDate: Tue Jul 13 20:28:47 2021 +0800
[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes #33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <sa...@oss.nttdata.com>
Signed-off-by: Wenchen Fan <we...@databricks.com>
(cherry picked from commit 57a4f310df30257c5d7e545c962b59767807d0c7)
Signed-off-by: Wenchen Fan <we...@databricks.com>
---
.../main/java/org/apache/spark/unsafe/types/UTF8String.java | 10 +++++-----
docs/sql-migration-guide.md | 2 ++
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index c6aa5f0..a226699 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -563,10 +563,10 @@ public final class UTF8String implements Comparable<UTF8String>, Externalizable,
}
/**
- * Trims whitespaces ({@literal <=} ASCII 32) from both ends of this string.
+ * Trims whitespace ASCII characters from both ends of this string.
*
- * Note that, this method is the same as java's {@link String#trim}, and different from
- * {@link UTF8String#trim()} which remove only spaces(= ASCII 32) from both ends.
+ * Note that, this method is different from {@link UTF8String#trim()} which removes
+ * only spaces(= ASCII 32) from both ends.
*
* @return A UTF8String whose value is this UTF8String, with any leading and trailing white
* space removed, or this UTF8String if it has no leading or trailing whitespace.
@@ -574,13 +574,13 @@ public final class UTF8String implements Comparable<UTF8String>, Externalizable,
*/
public UTF8String trimAll() {
int s = 0;
- // skip all of the whitespaces (<=0x20) in the left side
+ // skip all of the whitespaces in the left side
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
if (s == this.numBytes) {
// Everything trimmed
return EMPTY_UTF8;
}
- // skip all of the whitespaces (<=0x20) in the right side
+ // skip all of the whitespaces in the right side
int e = this.numBytes - 1;
while (e > s && Character.isWhitespace(getByte(e))) e--;
if (s == 0 && e == numBytes - 1) {
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 195a48f..c7f233d 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -78,6 +78,8 @@ license: |
- In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference.
+- In Spark 3.0, when casting string to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing characters (<= ASCII 32) will be trimmed. For example, `cast('\b1\b' as int)` results `1`. Since Spark 3.0.1, only the leading and trailing whitespace ASCII characters will be trimmed. For example, `cast('\t1\t' as int)` results `1` but `cast('\b1\b' as int)` results `NULL`.
+
## Upgrading from Spark SQL 2.4 to 3.0
### Dataset/DataFrame APIs
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org