You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2020/05/27 19:01:43 UTC

[spark] branch branch-3.0 updated: [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 0383d1e  [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
0383d1e is described below

commit 0383d1efe7a7ada8a202fd411bf32b3ed80c9ce4
Author: Wenchen Fan <we...@databricks.com>
AuthorDate: Wed May 27 18:53:19 2020 +0000

    [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
    
    If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message.
    
    to keep backward compatibility with Spark 2.4
    
    Yes
    
    Spark 2.4
    ```
    scala> sql("select date_format('1990-1-1', 'LLL')").show
    +---------------------------------------------+
    |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
    +---------------------------------------------+
    |                                          Jan|
    +---------------------------------------------+
    ```
    
    Spark 3.0 with Java 11
    ```
    scala> sql("select date_format('1990-1-1', 'LLL')").show
    +---------------------------------------------+
    |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
    +---------------------------------------------+
    |                                          Jan|
    +---------------------------------------------+
    ```
    
    Spark 3.0 with Java 8
    ```
    // before this PR
    +---------------------------------------------+
    |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
    +---------------------------------------------+
    |                                            1|
    +---------------------------------------------+
    // after this PR
    scala> sql("select date_format('1990-1-1', 'LLL')").show
    org.apache.spark.SparkUpgradeException
    ```
    
    manual test with java 8 and 11
    
    Closes #28646 from cloud-fan/format.
    
    Authored-by: Wenchen Fan <we...@databricks.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
---
 docs/sql-ref-datetime-pattern.md                       |  7 ++++---
 .../sql/catalyst/util/DateTimeFormatterHelper.scala    | 18 +++++++++++++++++-
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/docs/sql-ref-datetime-pattern.md b/docs/sql-ref-datetime-pattern.md
index 0e00e7b..48e85b4 100644
--- a/docs/sql-ref-datetime-pattern.md
+++ b/docs/sql-ref-datetime-pattern.md
@@ -76,7 +76,8 @@ The count of pattern letters determines the format.
 
 - Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is [...]
 
-- Month: If the number of pattern letters is 3 or more, the month is interpreted as text; otherwise, it is interpreted as a number. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. The difference between the 'standard' and 'stand-alone' forms is trickier to describe as there is no difference in English. However, in other languages there is a difference in the word used when the text is used alone, as opposed to in a complete date. F [...]
+- Month: It follows the rule of Number/Text. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. These two forms are different only in some certain languages. For example, in Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters:
+  - `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding.
     ```sql
     spark-sql> select date_format(date '1970-01-01', "M");
     1
@@ -106,8 +107,8 @@ The count of pattern letters determines the format.
     ```
   - `'MMMM'`: full textual month representation in the standard form. It is used for parsing/formatting months as a part of dates/timestamps.
     ```sql
-    spark-sql> select date_format(date '1970-01-01', "MMMM yyyy");
-    January 1970
+    spark-sql> select date_format(date '1970-01-01', "d MMMM");
+    1 January
     spark-sql> select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'd MMMM', 'locale', 'RU'));
     1 января
     ```
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
index 8289568..353c074 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
@@ -217,9 +217,19 @@ private object DateTimeFormatterHelper {
     toFormatter(builder, TimestampFormatter.defaultLocale)
   }
 
+  private final val bugInStandAloneForm = {
+    // Java 8 has a bug for stand-alone form. See https://bugs.openjdk.java.net/browse/JDK-8114833
+    // Note: we only check the US locale so that it's a static check. It can produce false-negative
+    // as some locales are not affected by the bug. Since `L`/`q` is rarely used, we choose to not
+    // complicate the check here.
+    // TODO: remove it when we drop Java 8 support.
+    val formatter = DateTimeFormatter.ofPattern("LLL qqq", Locale.US)
+    formatter.format(LocalDate.of(2000, 1, 1)) == "1 1"
+  }
   final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p')
   final val unsupportedNarrowTextStyle =
-    Set("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "QQQQQ", "qqqqq", "uuuuu")
+    Seq("G", "M", "L", "E", "u", "Q", "q").map(_ * 5).toSet
+
   /**
    * In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for
    * parsing/formatting datetime values. The pattern string is incompatible with the one defined
@@ -243,6 +253,12 @@ private object DateTimeFormatterHelper {
           for (style <- unsupportedNarrowTextStyle if patternPart.contains(style)) {
             throw new IllegalArgumentException(s"Too many pattern letters: ${style.head}")
           }
+          if (bugInStandAloneForm && (patternPart.contains("LLL") || patternPart.contains("qqq"))) {
+            throw new IllegalArgumentException("Java 8 has a bug to support stand-alone " +
+              "form (3 or more 'L' or 'q' in the pattern string). Please use 'M' or 'Q' instead, " +
+              "or upgrade your Java version. For more details, please read " +
+              "https://bugs.openjdk.java.net/browse/JDK-8114833")
+          }
           // The meaning of 'u' was day number of week in SimpleDateFormat, it was changed to year
           // in DateTimeFormatter. Substitute 'u' to 'e' and use DateTimeFormatter to parse the
           // string. If parsable, return the result; otherwise, fall back to 'u', and then use the


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org