You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2020/05/31 12:43:17 UTC

[spark] branch branch-3.0 updated: [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new dc0d12a  [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
dc0d12a is described below

commit dc0d12ac3ebef7c11b89f890757ebbbd5adeecfc
Author: Kent Yao <ya...@hotmail.com>
AuthorDate: Sun May 31 12:34:39 2020 +0000

    [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
    
    As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075
    
    In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below
    
    he new datetime formatter introduces silent data change like,
    
    ```sql
    spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
    NULL
    spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
    spark.sql.legacy.timeParserPolicy	legacy
    spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
    00000001970-01-01
    spark-sql>
    ```
    
    For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it
    
    ```java
    switch (signStyle) {
      case EXCEEDS_PAD:
        if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
          buf.append(decimalStyle.getPositiveSign());
        }
        break;
    
               ....
    ```
    the `minWidth` == `len(y..y)`
    the `EXCEED_POINTS` is
    
    ```java
    /**
             * Array of 10 to the power of n.
             */
            static final long[] EXCEED_POINTS = new long[] {
                0L,
                10L,
                100L,
                1000L,
                10000L,
                100000L,
                1000000L,
                10000000L,
                100000000L,
                1000000000L,
                10000000000L,
            };
    ```
    
    So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised.
    
     And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue.
    
    fix silent data change
    
    Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous  'y' or 'Y'
    
    new tests
    
    Closes #28684 from yaooqinn/SPARK-31867-2.
    
    Authored-by: Kent Yao <ya...@hotmail.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
    (cherry picked from commit 547c5bf55265772780098ee5e29baa6f095c246b)
    Signed-off-by: Wenchen Fan <we...@databricks.com>
---
 docs/sql-ref-datetime-pattern.md                   |  2 +-
 .../catalyst/util/DateTimeFormatterHelper.scala    | 14 ++++++++---
 .../util/DateTimeFormatterHelperSuite.scala        |  3 +--
 .../spark/sql/util/TimestampFormatterSuite.scala   |  5 ++--
 .../test/resources/sql-tests/inputs/datetime.sql   |  4 ++++
 .../sql-tests/results/ansi/datetime.sql.out        | 28 +++++++++++++++++++++-
 .../sql-tests/results/datetime-legacy.sql.out      | 26 +++++++++++++++++++-
 .../resources/sql-tests/results/datetime.sql.out   | 28 +++++++++++++++++++++-
 8 files changed, 99 insertions(+), 11 deletions(-)

diff --git a/docs/sql-ref-datetime-pattern.md b/docs/sql-ref-datetime-pattern.md
index 48e85b4..865b947 100644
--- a/docs/sql-ref-datetime-pattern.md
+++ b/docs/sql-ref-datetime-pattern.md
@@ -74,7 +74,7 @@ The count of pattern letters determines the format.
   For formatting, the fraction length would be padded to the number of contiguous 'S' with zeros.
   Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.
 
-- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is [...]
+- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is [...]
 
 - Month: It follows the rule of Number/Text. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. These two forms are different only in some certain languages. For example, in Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters:
   - `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
index 353c074..5b9d839 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
@@ -227,8 +227,16 @@ private object DateTimeFormatterHelper {
     formatter.format(LocalDate.of(2000, 1, 1)) == "1 1"
   }
   final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p')
-  final val unsupportedNarrowTextStyle =
-    Seq("G", "M", "L", "E", "u", "Q", "q").map(_ * 5).toSet
+  final val unsupportedPatternLengths = {
+    // SPARK-31771: Disable Narrow-form TextStyle to avoid silent data change, as it is Full-form in
+    // 2.4
+    Seq("G", "M", "L", "E", "u", "Q", "q").map(_ * 5) ++
+      // SPARK-31867: Disable year pattern longer than 10 which will cause Java time library throw
+      // unchecked `ArrayIndexOutOfBoundsException` by the `NumberPrinterParser` for formatting. It
+      // makes the call side difficult to handle exceptions and easily leads to silent data change
+      // because of the exceptions being suppressed.
+      Seq("y", "Y").map(_ * 11)
+  }.toSet
 
   /**
    * In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for
@@ -250,7 +258,7 @@ private object DateTimeFormatterHelper {
           for (c <- patternPart if unsupportedLetters.contains(c)) {
             throw new IllegalArgumentException(s"Illegal pattern character: $c")
           }
-          for (style <- unsupportedNarrowTextStyle if patternPart.contains(style)) {
+          for (style <- unsupportedPatternLengths if patternPart.contains(style)) {
             throw new IllegalArgumentException(s"Too many pattern letters: ${style.head}")
           }
           if (bugInStandAloneForm && (patternPart.contains("LLL") || patternPart.contains("qqq"))) {
diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala
index 34a1ad2..f0cc4d1 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala
@@ -40,8 +40,7 @@ class DateTimeFormatterHelperSuite extends SparkFunSuite {
       val e = intercept[IllegalArgumentException](convertIncompatiblePattern(s"yyyy-MM-dd $l G"))
       assert(e.getMessage === s"Illegal pattern character: $l")
     }
-
-    unsupportedNarrowTextStyle.foreach { style =>
+    unsupportedPatternLengths.foreach { style =>
       val e1 = intercept[IllegalArgumentException] {
         convertIncompatiblePattern(s"yyyy-MM-dd $style")
       }
diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala
index 4324d3c..a72dfb9 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala
@@ -403,8 +403,9 @@ class TimestampFormatterSuite extends SparkFunSuite with SQLHelper with Matchers
       intercept[IllegalArgumentException](TimestampFormatter(pattern, UTC).format(0))
     }
     // supported by the legacy one, then we will suggest users with SparkUpgradeException
-    Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa").foreach { pattern =>
-      intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
+    Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa", "y" * 11, "y" * 11)
+      .foreach { pattern =>
+        intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
     }
   }
 }
diff --git a/sql/core/src/test/resources/sql-tests/inputs/datetime.sql b/sql/core/src/test/resources/sql-tests/inputs/datetime.sql
index 99b7cf9..4eefa0f 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/datetime.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/datetime.sql
@@ -150,3 +150,7 @@ select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampF
 select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
 select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'));
 select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
+
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss');
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd');
diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
index ff30bb0..43fe0a6 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 109
+-- Number of queries: 112
 
 
 -- !query
@@ -939,3 +939,29 @@ struct<>
 -- !query output
 org.apache.spark.SparkUpgradeException
 You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
diff --git a/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out b/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out
index 30743f8..71b1064 100644
--- a/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 109
+-- Number of queries: 112
 
 
 -- !query
@@ -896,3 +896,27 @@ select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy
 struct<from_csv(26/October/2015):struct<date:date>>
 -- !query output
 {"date":2015-10-26}
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<from_unixtime(CAST(1 AS BIGINT), yyyyyyyyyyy-MM-dd):string>
+-- !query output
+00000001969-12-31
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<date_format(CAST(DATE '2018-11-17' AS TIMESTAMP), yyyyyyyyyyy-MM-dd):string>
+-- !query output
+00000002018-11-17
diff --git a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
index dc466d1..9b1c847 100755
--- a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 109
+-- Number of queries: 112
 
 
 -- !query
@@ -911,3 +911,29 @@ struct<>
 -- !query output
 org.apache.spark.SparkUpgradeException
 You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org