You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/02/15 17:13:35 UTC
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
GitHub user mgaido91 opened a pull request:
https://github.com/apache/spark/pull/20621
[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date
## What changes were proposed in this pull request?
Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`).
When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value.
The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.
## How was this patch tested?
added UT
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mgaido91/spark SPARK-23436
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20621.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20621
----
commit 2f05ab8e82b0940e84cbe407abe49f72cddeef11
Author: Marco Gaido <ma...@...>
Date: 2018-02-15T16:59:20Z
[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/20621
@gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168935309
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
`inferPartitioning` will use `PartitioningUtils.parsePartitions` to infer the partition spec if there is no `userPartitionSchema`. It is used by `DataSource.sourceSchema`. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168919128
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
Can we save the parsing here? I think to cast string to `TimestampType` will also check if it can be parsed as timestamp?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87485/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/921/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168910831
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala ---
@@ -1120,4 +1120,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
Row(3, BigDecimal("2" * 30)) :: Nil)
}
}
+
+ test("SPARK-23436: invalid Dates should be inferred as String in partition inference") {
+ withTempPath { path =>
+ val data = Seq(("1", "2018-01", "2018-01-01-04", "test"))
+ .toDF("id", "date_month", "date_hour", "data")
+
+ data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
+ val input = spark.read.parquet(path.getAbsolutePath)
+ checkAnswer(input.select("id", "date_month", "date_hour", "data"),
--- End diff --
shall we also check the schema to make sure that field is string type?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/923/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/20621
it's not a very serious bug, I'd like to hold it until 2.3 is released. We may have it in 2.3.1
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/20621
thanks @cloud-fan. Sorry, since this seems a bug to me, why this wasn't backported to branch-2.3 too? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/922/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87488/testReport)** for PR 20621 at commit [`6b56408`](https://github.com/apache/spark/commit/6b5640833a2d45986a0cf6074d7211a8ba9d2b3e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168700662
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateOption = Option(Cast(Literal(raw), DateType).eval())
--- End diff --
I mean .. simply like:
```
// Disallow date type if the cast returned null blah blah
require(dateOption.isDefine)
```
nothing special. I am fine with not adding it too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87508/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168925882
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87485/testReport)** for PR 20621 at commit [`2f05ab8`](https://github.com/apache/spark/commit/2f05ab8e82b0940e84cbe407abe49f72cddeef11).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/20621
cc @cloud-fan @HyukjinKwon @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20621
It sounds like Spark 2.2 already has this bug. This causes an incorrect result.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168680871
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateOption = Option(Cast(Literal(raw), DateType).eval())
--- End diff --
Can we add `require(dateOption.isDefine)` with some comments explicitly?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20621
This is a blocker-level regression.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87488/testReport)** for PR 20621 at commit [`6b56408`](https://github.com/apache/spark/commit/6b5640833a2d45986a0cf6074d7211a8ba9d2b3e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168680397
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
--- End diff --
Ah, so the root cause is more specific to `SimpleDateFormat` because it allows invalid dates like `2018-01-01-04` to be parsed fine ..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168697699
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateOption = Option(Cast(Literal(raw), DateType).eval())
--- End diff --
sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168923873
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala ---
@@ -1120,4 +1120,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
Row(3, BigDecimal("2" * 30)) :: Nil)
}
}
+
+ test("SPARK-23436: invalid Dates should be inferred as String in partition inference") {
+ withTempPath { path =>
+ val data = Seq(("1", "2018-01", "2018-01-01-04", "test"))
+ .toDF("id", "date_month", "date_hour", "data")
+
+ data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
+ val input = spark.read.parquet(path.getAbsolutePath)
+ checkAnswer(input.select("id", "date_month", "date_hour", "data"),
--- End diff --
I think it is not necessary, but I will add this check, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168924005
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to `DateTimeUtils.stringToTimestamp`). So I'd not consider safe to remove this and anyway it would/may introduce unintended behavior changes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168697351
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
--- End diff --
actually all the `DateFormat`'s `parse` allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168925339
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
Since this changes the behavior of `PartitioningUtils.parsePartitions`, doesn't it change the result of another path in `inferPartitioning`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168910779
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
--- End diff --
is this a short-cut? It seems OK to always go to the cast path,
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168924965
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,29 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
--- End diff --
I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87526/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/936/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/20621
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87488/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/948/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20621
**[Test build #87485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87485/testReport)** for PR 20621 at commit [`2f05ab8`](https://github.com/apache/spark/commit/2f05ab8e82b0940e84cbe407abe49f72cddeef11).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20621
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/20621
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20621#discussion_r168945339
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
Literal(bigDecimal)
}
+ val dateTry = Try {
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // DateType
+ DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+ // SPARK-23436: Casting the string to date may still return null if a bad Date is provided.
+ // This can happen since DateFormat.parse may not use the entire text of the given string:
+ // so if there are extra-characters after the date, it returns correctly.
+ // We need to check that we can cast the raw string since we later can use Cast to get
+ // the partition values with the right DataType (see
+ // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+ val dateValue = Cast(Literal(raw), DateType).eval()
+ // Disallow DateType if the cast returned null
+ require(dateValue != null)
+ Literal.create(dateValue, DateType)
+ }
+
+ val timestampTry = Try {
+ val unescapedRaw = unescapePathName(raw)
+ // try and parse the date, if no exception occurs this is a candidate to be resolved as
+ // TimestampType
+ DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --
Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org