You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/19 17:24:22 UTC

[GitHub] [spark] xiaonanyang-db opened a new pull request, #37933: SPARK-40474 Infer columns with mixed date and timestamp as String in CSV schema inference

xiaonanyang-db opened a new pull request, #37933:
URL: https://github.com/apache/spark/pull/37933

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Adjust part of changes in https://github.com/apache/spark/pull/36871. 
   In the pr above, we introduced the support of date type in CSV schema inference. The schema inference behavior on date time columns now is:
   - For a column only containing dates, we will infer it as Date type
   - For a column only containing timestamps, we will infer it as Timestamp type
   - For a column containing a mixture of dates and timestamps, we will infer it as Timestamp type
   
   However, we found that we are too ambitious on the last scenario, to support which we have introduced much complexity in code and caused a lot of performance concerns. Thus, we want to simplify the behavior of the last scenario as:
   
   - For a column containing a mixture of dates and timestamps, we will infer it as String type
   
   ### Why are the changes needed?
   Simplify CSV dateTime inference logic.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No compared to the previous PR.
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not given.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible.
   Thus, we added the additional handling here to have similar behavior as above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "yyyy/MM/dd"
   timestampFormat = "yyyy/MM/dd HH:mm:ss" 
   ```
   Because `timestampFormat` is specified, we will use the strict parser.
   In this case, a column with a mix of dates and timestamps will be inferred as `StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976077440


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   We need this set to determine inferring a column with mixture of dates and timestamps as `TimestampType` or `StringType` when no timestamp format is specified (the lenient timestamp formatter will be used)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values
       - If `prefersDate=true`, the column will be inferred as `StringType`
       - otherwise
         - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
         - otherwise, the column will be inferred as `StringType`
       - otherwise the column will be inferred as `StringType`
     - If timestamp values are before date values
       - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values
       - If `prefersDate=true`, the column will be inferred as `StringType`
       - otherwise
         - If the date format is supported by `DefaultTimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
         - otherwise, the column will be inferred as `StringType`
     - If timestamp values are before date values
       - If the date format is supported by `DefaultTimestampFormatter`, the column will be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values
       - If `prefersDate=true`, the column will be inferred as `StringType`
       - otherwise the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter
     - If timestamp values are before date values
       - the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter
       
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   There is no easy way to go this behavior as well. Because the timestamp parser 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "yyyy/MM/dd"
   timestampFormat = "yyyy/MM/dd HH:mm:ss" 
   ```
   I don't quite understand your question on this case. 
   But speaking in the context of this PR, because `timestampFormat` is specified, we will use the strict parser, a column with a mix of dates and timestamps will be inferred as `StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974909141


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   That will be nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974985351


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   Opened https://github.com/apache/spark/pull/37942 to fix this. Apologies for the mistake.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
URL: https://github.com/apache/spark/pull/37933


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] brkyvz commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
brkyvz commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r978224414


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,13 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  // Used to determine inferring a column with mixture of dates and timestamps as TimestampType or
+  // StringType when no timestamp format is specified (the lenient timestamp formatter will be used)
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(
+    "yyyy-MM-dd", "yyyy-M-d", "yyyy-M-dd", "yyyy-MM-d", "yyyy-MM", "yyyy-M", "yyyy")

Review Comment:
   These are variants of the lenient timestamp formatter's supported date formats. It's not a hardcoded list of what the user can use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976089392


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   What result does `findTightestCommonType` return for DateType and TimestampType?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977085297


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   nvm, I got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253301260

   There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.
   
   1. pure date + no datetime pattern: infer as date type
   2. pure timestamp + no datetime pattern: infer as timestamp type
   3. mixture + no datetime pattern: infer as timestamp type
   4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
   5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
   6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If `prefersDate` is true, only try to infer as date type, otherwise only try to infer as timestamp.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974905472


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   Thanks for fixing it, the values were swapped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1251869330

   > Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.
   > 
   > Is it correct that the date inference is still controlled by "prefersDate"?
   
   Sure!
   Yes, it's still controlled by "prefersDate".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974908544


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   @xiaonanyang-db Does it mean that we don't have test coverage for it? I am wondering how I managed to merge like this. I will open a follow-up PR to improve testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "yyyy/MM/dd"
   timestampFormat = "yyyy/MM/dd HH:mm:ss" 
   ```
   I don't quite understand your question on this case. 
   But speaking in the context of this PR, because `timestampFormat` is specified, a column with a mix of dates and timestamps will be inferred as `StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976070903


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   `findTightestCommonType` merge DateType and TimestampType in a way that is not applicable here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976042644


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Made some follow-up changes, please check the updated description for the behavior after changes and semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1255000616

   Can we update PR description to mention that `prefersDate` is by default true now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086043


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974986988


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   - There are two types of timestampFormatter: Default and Iso8601.
     - The default formatter use same parsing logic as CAST, which can parse a date value as timestamp.
     - The Iso8601 has more restrict parse. 
     - If no timestampFormat is given, we will use the default one 
     
   - In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp and infer the column as timestamp type, which is inconsistent with what we are proposing now.
   
   It’s pretty similar to the legacy parser problem.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan & @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, we keep current behavior that columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values, the column will be inferred as `timestampFormat/timestampNTZFormat`
     - If timestamp values are before date values
       - If the date format is supported by `Iso8601TimestampFormatter `, the column will still be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible.
   Thus, we added the additional handling here for a similar behavior as above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253761727

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values
       - If `prefersDate=true`, the column will be inferred as `StringType`
       - otherwise
         - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
         - otherwise, the column will be inferred as `StringType`
     - If timestamp values are before date values
       - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976046176


##########
docs/sql-data-sources-csv.md:
##########
@@ -111,7 +111,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>prefersDate</code></td>
     <td>false</td>
-    <td>During schema inference (<code>inferSchema</code>), attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy the <code>dateFormat</code> option and failed to be parsed by the respective formatter. With a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, in this case the parsed values will be cast to timestamp type afterwards.</td>
+    <td>During schema inference (<code>inferSchema</code>), attempts to infer string columns that contain dates as <code>Date</code> if the values satisfy the <code>dateFormat</code> option or default date format. For columns that contain mixing dates and timestamps, infer them as <code>StringType</code>.</td>

Review Comment:
   nit: `... a mix/mixture of dates and timestamps ...`?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing mixing dates and timestamps

Review Comment:
   Same here, just a bit of rewording.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing mixing dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return if strings of given date format can be parsed as timestamps

Review Comment:
   nit: Returns `true` if strings ...



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2819,51 +2819,68 @@ abstract class CSVSuite
     }
   }
 
-  test("SPARK-39469: Infer schema for date type") {
-    val options1 = Map(
-      "header" -> "true",
-      "inferSchema" -> "true",
-      "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
-      "dateFormat" -> "yyyy-MM-dd",
-      "prefersDate" -> "true")
-    val options2 = Map(
-      "header" -> "true",
-      "inferSchema" -> "true",
-      "prefersDate" -> "true")
-
-    // Error should be thrown when attempting to prefersDate with Legacy parser
-    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
-      checkError(
-        exception = intercept[SparkIllegalArgumentException] {
-          spark.read.format("csv").options(options1).load(testFile(dateInferSchemaFile))
-        },
-        errorClass = "CANNOT_INFER_DATE")
-    } else {
-      // 1. Specify date format and timestamp format
-      // 2. Date inference should work with default date format when dateFormat is not provided
-      Seq(options1, options2).foreach {options =>
+  test("SPARK-39469: Infer schema for columns with only dates " +
+    "and columns with mixing date and timestamps correctly") {
+    def checkCSVReadDatetime(
+      options: Map[String, String],
+      expectedSchema: StructType,
+      expectedData: Seq[Seq[Any]]): Unit = {
+
+      // Error should be thrown when attempting to prefersDate with Legacy parser

Review Comment:
   nit: `to use prefersDate ...`.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2819,51 +2819,68 @@ abstract class CSVSuite
     }
   }
 
-  test("SPARK-39469: Infer schema for date type") {
-    val options1 = Map(
-      "header" -> "true",
-      "inferSchema" -> "true",
-      "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
-      "dateFormat" -> "yyyy-MM-dd",
-      "prefersDate" -> "true")
-    val options2 = Map(
-      "header" -> "true",
-      "inferSchema" -> "true",
-      "prefersDate" -> "true")
-
-    // Error should be thrown when attempting to prefersDate with Legacy parser
-    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
-      checkError(
-        exception = intercept[SparkIllegalArgumentException] {
-          spark.read.format("csv").options(options1).load(testFile(dateInferSchemaFile))
-        },
-        errorClass = "CANNOT_INFER_DATE")
-    } else {
-      // 1. Specify date format and timestamp format
-      // 2. Date inference should work with default date format when dateFormat is not provided
-      Seq(options1, options2).foreach {options =>
+  test("SPARK-39469: Infer schema for columns with only dates " +
+    "and columns with mixing date and timestamps correctly") {
+    def checkCSVReadDatetime(
+      options: Map[String, String],

Review Comment:
   nit: 4 spaces.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing mixing dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.]

Review Comment:
   nit: remove `]` at the end.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   Should this match be in `findCompatibleTypeForCSV`? Or does `findTightestCommonType` merge DateType and TimestampType in a way that is not applicable here?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   This has been fixed, could you rebase? 



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   Sorry, I don't quite get this part, would you be able to elaborate on why we need to keep a set of such formats? 
   
   More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.
   
   Also, will this work? My understanding is that we will not, only yyyy-MM-dd.
   ```
   dateFormat = "yyyy/MM/dd"
   timestampFormat = "yyyy/MM/dd HH:mm:ss" 
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976070903


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   `findTightestCommonType` merge DateType and TimestampType in a way that is not applicable here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976091402


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976108061


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps will be inferred as timestamp type if the lenient timestamp formatter could parse the date strings under the column as well. To achieve that without bringing other performance concern, we want to simply check if the dateFormat could be supported by the lenient timestamp formatter. Does that make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253325002

   > I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.
   > 
   > I am just not sure about hardcoding date formats in the CSVInferSchema parser.
   
   Thanks @sadikovi. Understand your concern, it's a trade off on introducing performance degradation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1255902174

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976083738


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   > More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.
   
   That's some interesting formats, I am not sure if we need to take care of  them here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1254315080

   @cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976090455


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {

Review Comment:
   ```
   (d1, d2) match {
         case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) =>
           TimestampType
   
         case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) =>
           TimestampType
   
         case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) =>
           TimestampNTZType
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976097051


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253319740

   > There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.
   > 
   > 1. pure date + no datetime pattern: infer as date type
   > 2. pure timestamp + no datetime pattern: infer as timestamp type
   > 3. mixture + no datetime pattern: infer as timestamp type
   > 4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
   > 5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
   > 6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If `prefersDate` is true, only try to infer as date type, otherwise only try to infer as timestamp.
   
   Thanks @cloud-fan
   Case 1, 2, 4, 5 are already supported, case 3 is also already supported but this PR adjusts the implementation.
   For case 6, the behavior after this PR is that we will not always "first parse the data as date, if can't, try parse as timestamp". - When typeSoFar is `DateType` or some other tighter type, we will "first parse the data as date, if can't, try parse as timestamp"
   - However, when typeSoFar is already `TimestampType` and we encounter another date value, we will directly parse it as timestamp, which would fail and then finalize as `StringType`. 
   In reality, we would not invoke the formatter twice in most cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r978223404


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,13 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  // Used to determine inferring a column with mixture of dates and timestamps as TimestampType or
+  // StringType when no timestamp format is specified (the lenient timestamp formatter will be used)
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(
+    "yyyy-MM-dd", "yyyy-M-d", "yyyy-M-dd", "yyyy-MM-d", "yyyy-MM", "yyyy-M", "yyyy")

Review Comment:
   What about +/-, and other strings within this date pattern? i don' think we should have such hardcoded dates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r978268428


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2927,17 +2974,17 @@ abstract class CSVSuite
       }
 
       check(
-        "legacy",
+        "corrected",

Review Comment:
   to reduce the diff, can we still test `legacy` first?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974975695


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   why do we need this change? when inferring schema, we always try to infer as date then timestamp, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values
       - If `prefersDate=true`, the column will be inferred as `StringType`
       - otherwise
         - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
         - otherwise, the column will be inferred as `StringType`
     - If timestamp values are before date values
       - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086276


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -149,13 +149,10 @@ class CSVOptions(
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
   /**
-   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type)
-   * if schema inference is enabled. When being used with user-provided schema, tries to parse
-   * timestamp values as dates if the values do not conform to the timestamp formatter before
-   * falling back to the backward compatible parsing - the parsed values will be cast to timestamp
-   * afterwards.
+   * Infer columns with all valid date entries as date type (otherwise inferred as string type)

Review Comment:
   ```suggestion
      * Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible.
   Thus, we added the additional handling here to have similar behavior as above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible.
   Thus, we added the additional handling here for a consistent behavior as above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977077958


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   I don't quite understand the rationale here. why can't we directly return `Some(StringType)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1251908119

   Let me re-review the change to use ISO8601 parsing only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974908544


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   @xiaonanyang-db Does it mean that we don't have test coverage for it? I am wondering how I managed to merge a PR like this without a unit test 🤔. I will open a follow-up PR to improve testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974998043


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Let me think about it as well as there are different fallbacks with the regard to date/timestamp parsing. IMHO, we would want to have some kind of flag to revert to the original behaviour but this would be a lot of flags to configure. 
   
   It is possible that some users might be relying on a more lenient parsing of timestamps so if we switch to ISO8601 only, some of the jobs might need to be updated. Let me think a bit more on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974995602


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   I see, but the CAST code is supposed to be more efficient than the formatter. Do we have some perf numbers? Is there a slowdown for schema inference with timestamp value only CSV files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977085759


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType

Review Comment:
   let's enrich the comment a bit more
   ```
   This only happens when the timestamp pattern is not specified, as the default
   timestamp parser is very lenient and can parse date string as well.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253311617

   I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible. 
   
   I am just not sure about hardcoding date formats in the CSVInferSchema parser.
   
   A bit off-topic, would specifying `.schema` when reading a CSV file or casting timestamp columns to dates solve this problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps will could be inferred as timestamp when no timestamp pattern specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r978269589


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2927,17 +2974,17 @@ abstract class CSVSuite
       }
 
       check(
-        "legacy",
+        "corrected",

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, we keep current behavior that columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan & @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976108061


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well. 
   
   To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086043


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can only parse date as string if neither timestamp pattern nor date pattern is not specified.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType
+        val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+        t1 match {
+          case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+            Some(t2)
+          case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) =>
+            Some(t1)
+          case _ => Some(StringType)
+        }
+      case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    }
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is not specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974948772


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -224,7 +223,7 @@ class UnivocityParser(
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility if enabled.
-            if (!enableParsingFallbackForTimestampType) {
+            if (!enableParsingFallbackForDateType) {

Review Comment:
   After thinking about it more, let me submit a separate PR to fix it, this is a serious bug and I will fix it ASAP, we may need to port to other Spark branches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974973788


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type
+      Some(parameters.getOrElse("timestampFormat", TimestampFormatter.defaultPattern()))

Review Comment:
   `timestampFormatInRead` doesn't need to be `Option[String]` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974986988


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   - There are two types of timestampFormatter: Default and Iso8601.
     - The default formatter use same parsing logic as CAST, which can parse a date value as timestamp.
     - The Iso8601 has more restrict parse. 
     - If not timestampFormat is given, we will use the default one 
     
   - In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp type, which is not consistent with what we are proposing now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Posted by GitBox <gi...@apache.org>.
xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -183,7 +180,9 @@ class CSVOptions(
       Some(parameters.getOrElse("timestampFormat",
         s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
     } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps
     - If date values are before timestamp values, the column will be inferred as `StringType`
     - If timestamp values are before date values
       - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat`
       - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org