You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/02/15 17:13:35 UTC

[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/20621

    [SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date

    ## What changes were proposed in this pull request?
    
    Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`).
    
    When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value.
    
    The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-23436

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20621
    
----
commit 2f05ab8e82b0940e84cbe407abe49f72cddeef11
Author: Marco Gaido <ma...@...>
Date:   2018-02-15T16:59:20Z

    [SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    @gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168935309
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    `inferPartitioning` will use `PartitioningUtils.parsePartitions` to infer the partition spec if there is no `userPartitionSchema`. It is used by `DataSource.sourceSchema`. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168919128
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    Can we save the parsing here? I think to cast string to `TimestampType` will also check if it can be parsed as timestamp?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87485/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/921/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168910831
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala ---
    @@ -1120,4 +1120,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
               Row(3, BigDecimal("2" * 30)) :: Nil)
         }
       }
    +
    +  test("SPARK-23436: invalid Dates should be inferred as String in partition inference") {
    +    withTempPath { path =>
    +      val data = Seq(("1", "2018-01", "2018-01-01-04", "test"))
    +        .toDF("id", "date_month", "date_hour", "data")
    +
    +      data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
    +      val input = spark.read.parquet(path.getAbsolutePath)
    +      checkAnswer(input.select("id", "date_month", "date_hour", "data"),
    --- End diff --
    
    shall we also check the schema to make sure that field is string type?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/923/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    it's not a very serious bug, I'd like to hold it until 2.3 is released. We may have it in 2.3.1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    thanks @cloud-fan. Sorry, since this seems a bug to me, why this wasn't backported to branch-2.3 too? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/922/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87488/testReport)** for PR 20621 at commit [`6b56408`](https://github.com/apache/spark/commit/6b5640833a2d45986a0cf6074d7211a8ba9d2b3e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168700662
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateOption = Option(Cast(Literal(raw), DateType).eval())
    --- End diff --
    
    I mean .. simply like:
    
    ```
    // Disallow date type if the cast returned null blah blah
    require(dateOption.isDefine)
    ```
    
    nothing special. I am fine with not adding it too.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87508/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168925882
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87485/testReport)** for PR 20621 at commit [`2f05ab8`](https://github.com/apache/spark/commit/2f05ab8e82b0940e84cbe407abe49f72cddeef11).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    cc @cloud-fan @HyukjinKwon @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    It sounds like Spark 2.2 already has this bug. This causes an incorrect result.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168680871
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateOption = Option(Cast(Literal(raw), DateType).eval())
    --- End diff --
    
    Can we add `require(dateOption.isDefine)` with some comments explicitly?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    This is a blocker-level regression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87488/testReport)** for PR 20621 at commit [`6b56408`](https://github.com/apache/spark/commit/6b5640833a2d45986a0cf6074d7211a8ba9d2b3e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168680397
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    --- End diff --
    
    Ah, so the root cause is more specific to `SimpleDateFormat` because it allows invalid dates like `2018-01-01-04` to be parsed fine ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168697699
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateOption = Option(Cast(Literal(raw), DateType).eval())
    --- End diff --
    
    sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168923873
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala ---
    @@ -1120,4 +1120,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
               Row(3, BigDecimal("2" * 30)) :: Nil)
         }
       }
    +
    +  test("SPARK-23436: invalid Dates should be inferred as String in partition inference") {
    +    withTempPath { path =>
    +      val data = Seq(("1", "2018-01", "2018-01-01-04", "test"))
    +        .toDF("id", "date_month", "date_hour", "data")
    +
    +      data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
    +      val input = spark.read.parquet(path.getAbsolutePath)
    +      checkAnswer(input.select("id", "date_month", "date_hour", "data"),
    --- End diff --
    
    I think it is not necessary, but I will add this check, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168924005
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to `DateTimeUtils.stringToTimestamp`). So I'd not consider safe to remove this and anyway it would/may introduce unintended behavior changes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168697351
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    --- End diff --
    
    actually all the `DateFormat`'s `parse` allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168925339
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    Since this changes the behavior of `PartitioningUtils.parsePartitions`, doesn't it change the result of another path in `inferPartitioning`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168910779
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    --- End diff --
    
    is this a short-cut? It seems OK to always go to the cast path,


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168924965
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,29 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    --- End diff --
    
    I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87526/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/936/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87488/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/948/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    **[Test build #87485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87485/testReport)** for PR 20621 at commit [`2f05ab8`](https://github.com/apache/spark/commit/2f05ab8e82b0940e84cbe407abe49f72cddeef11).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20621


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20621#discussion_r168945339
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -407,6 +407,34 @@ object PartitioningUtils {
           Literal(bigDecimal)
         }
     
    +    val dateTry = Try {
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // DateType
    +      DateTimeUtils.getThreadLocalDateFormat.parse(raw)
    +      // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
    +      // This can happen since DateFormat.parse  may not use the entire text of the given string:
    +      // so if there are extra-characters after the date, it returns correctly.
    +      // We need to check that we can cast the raw string since we later can use Cast to get
    +      // the partition values with the right DataType (see
    +      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
    +      val dateValue = Cast(Literal(raw), DateType).eval()
    +      // Disallow DateType if the cast returned null
    +      require(dateValue != null)
    +      Literal.create(dateValue, DateType)
    +    }
    +
    +    val timestampTry = Try {
    +      val unescapedRaw = unescapePathName(raw)
    +      // try and parse the date, if no exception occurs this is a candidate to be resolved as
    +      // TimestampType
    +      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
    --- End diff --
    
    Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org