You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/15 00:49:56 UTC

[GitHub] [spark] Jonathancui123 opened a new pull request, #36871: infer date type in csv

Jonathancui123 opened a new pull request, #36871:
URL: https://github.com/apache/spark/pull/36871

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906233209


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -23,6 +23,12 @@
     ],
     "sqlState" : "22005"
   },
+  "CANNOT_INFER_DATE" : {
+    "message" : [
+      "Cannot infer date in schema inference when LegacyTimeParserPolicy is \"LEGACY\". Legacy Date formatter does not support strict date format matching which is required to avoid inferring timestamps and other non-date entries to date."
+    ],
+    "sqlState" : "22005"

Review Comment:
   Is this the right sqlState code? Tests would not pass until I added a sqlState code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1163777137

   I added a new `QueryExecutionError` called `inferDateWithLegacyTimeParserError` that is thrown when `inferDate=true` and SQL Configuration `LegacyTimeParserPolicy` is `LEGACY`. This prevents the legacy parser from being used with schema inference for date and fixes the failing test 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904617613


##########
docs/sql-data-sources-csv.md:
##########
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
     <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDate</code></td> 
+    <td>false</td>
+    <td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. Legacy date formats in Timestamp columns cannot be parsed with this option.</td>

Review Comment:
   Let's mention what will happen if the option is false and all values are dates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904622578


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)

Review Comment:
   It seems changing the method `tryParseDouble` should be enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)
+              } else {
+                throw(e)
+              }
+            }
+        }
+      }
+
+    case _: TimestampNTZType => (d: String) =>
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+        } catch {
+          case NonFatal(e) if (options.inferDate) =>
+            daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)
+              } else {
+                throw(e)
+              }
+            }
+        }
+      }
+
+    case _: TimestampNTZType => (d: String) =>
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+        } catch {
+          case NonFatal(e) if (options.inferDate) =>
+            daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. 
   
   I've fixed the error and I've modified this test to have a PST user and check that the parsed date is converted to a timestamp in UTC. 
   
   https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1192923903

   > Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as null. However, in the test it is returned as 1765-03-28 00:00:00.0. This is at very least confusing - inferDate should only affect dates, not timestamp columns.
   
   @sadikovi I've verified on an older version of spark prior to this PR: "1765-03-28" in a timestamp column without user specified format is parsed as "1765-03-28 00:00:00.0". So the behavior of parsing default date format in timestamp columns is not due to this PR. 
   
   Prior to changes, in a timestamp column:
   - custom format date: null
   - default format date: Parsed by fallback
   
   After inferDate PR, in a timestamp column:
   - custom format date: Parsed if inferDate is true, otherwise null
   - default format date: Parsed by fallback
   
   After enableParsingFallbackForDateType PR (#37147), in a timestamp column: 
   - custom format date: null
   - default format date: null
   
   Since default format date was previously parsed by fallback, we thought it was desirable for dates to be parsed in a timestamp column. So we included support for custom format dates in a timestamp column when `inferDate`=`true`.
   
   We have two options for target behavior:
   
   OPTION A - Target behavior, in a timestamp column: 
   - custom format date: Parsed if inferDate is true, otherwise null
   - default format date: Parsed if inferDate is true, otherwise null
   --> move the fallback error in #37147 to allow date parsing if `inferDate`=`true`
   
   OPTION B - Target behavior, in a timestamp column: 
   - custom format date: null
   - default format date: null
   --> remove redundant dateFormatter.parse behavior for `inferDate` = `true` 
   
   @cloud-fan  @HyukjinKwon should we go with Option A or Option B?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906409769


##########
docs/sql-data-sources-csv.md:
##########
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
     <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDate</code></td> 
+    <td>false</td>
+    <td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>) Legacy date formats in Timestamp columns cannot be parsed with this option.</td>

Review Comment:
   >fits the <code>timestampFormat</code>) Legacy date
   
   Probably needs a period in there after the close parenthesis.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1162506485

   I've added a `inferDate` flag that addresses the comments above:
   1. Performance: Logic for date inference during is guarded behind the flag
   2. Parsing behavior inferDate=false: Behavior is identical to prior to changes
   3. Parsing behavior inferDate=true: In timestamp columns, we will attempt to parse as timestamp and then parse as date if timestamp fails. 
   > This is necessary because we can have columns of Date and Timestamp entries inferred as Timestamp. Consider the column of a `DateType` followed by a `TimestampType`. We would expect this column to be inferred as a `TimestampType` column. Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904450171


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -195,6 +195,19 @@ class CSVOptions(
    */
   val enforceSchema = getBool("enforceSchema", default = true)
 
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {
+      throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()

Review Comment:
   Yeah, this is nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r899496784


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),

Review Comment:
   >to make sure timestamps are not parsed as date types without conflicting.
   
   That's actually what happens:
   
   Before this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: timestamp (nullable = true)
   
   scala> 
   ```
   After this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: date (nullable = true)
   
   scala>
   ```
   It looks like some tests fail too, like `CSVInferSchemaSuite`,  and `CSVv1Suite`, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).
   
   >We should probably 1. add either SQL configuration or an option e.g., infersDate
   
   I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r924930337


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   All tests have passed after the rollback - I think this is ready to merge



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)
+              } else {
+                throw(e)
+              }
+            }
+        }
+      }
+
+    case _: TimestampNTZType => (d: String) =>
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+        } catch {
+          case NonFatal(e) if (options.inferDate) =>
+            daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. 
   
   I've fixed the error and I've modified this test to have a PST user and check that the parsed date is converted to a timestamp in UTC. 
   
   https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898694871


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {

Review Comment:
   Got it! I was under the impression that was only for bug fixes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903187641


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error when invalid timstamps are given. 
   
   e.g. The legacy DateFormatter will parse the following string without throwing an error:
   dateFormat: `yyyy-mm-dd`
   string: `2001-09-08-randomtext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1162608355

   One test failed at `org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite`
   
   ```
   2022-06-22T01:10:13.0585659Z [info] - SPARK-39469: Infer schema for date type *** FAILED *** (106 milliseconds)
   2022-06-22T01:10:13.0592465Z [info]   StructType(StructField(date,DateType,true), StructField(timestamp-date,DateType,true), StructField(date-timestamp,DateType,true)) did not equal StructType(StructField(date,DateType,true), StructField(timestamp-date,TimestampType,true), StructField(date-timestamp,TimestampType,true)) (CSVSuite.scala:2773)
   2022-06-22T01:10:13.0598357Z [info]   Analysis:
   2022-06-22T01:10:13.0604058Z [info]   StructType(1: StructField(timestamp-date,DateType,true) -> StructField(timestamp-date,TimestampType,true), 2: StructField(date-timestamp,DateType,true) -> StructField(date-timestamp,TimestampType,true))
   2022-06-22T01:10:13.0610333Z [info]   org.scalatest.exceptions.TestFailedException:
   2022-06-22T01:10:13.0622299Z [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   2022-06-22T01:10:13.0627015Z [info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   2022-06-22T01:10:13.0634808Z [info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   2022-06-22T01:10:13.0639642Z [info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
   2022-06-22T01:10:13.0647369Z [info]   at org.apache.spark.sql.execution.datasources.csv.CSVSuite.$anonfun$new$379(CSVSuite.scala:2773)
   2022-06-22T01:10:13.0655897Z [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   2022-06-22T01:10:13.0659677Z [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   2022-06-22T01:10:13.0663352Z [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   2022-06-22T01:10:13.0666989Z [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   2022-06-22T01:10:13.0670619Z [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   2022-06-22T01:10:13.0674478Z [info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
   2022-06-22T01:10:13.0678412Z [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
   2022-06-22T01:10:13.0694953Z [info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
   2022-06-22T01:10:13.0699731Z [info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
   2022-06-22T01:10:13.0703990Z [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   2022-06-22T01:10:13.0708256Z [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
   2022-06-22T01:10:13.0712723Z [info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
   2022-06-22T01:10:13.0717517Z [info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
   2022-06-22T01:10:13.0721680Z [info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
   2022-06-22T01:10:13.0726397Z [info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904449724


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType =>

Review Comment:
   Maybe we can fix it as below:
   
   ```suggestion
           case TimestampNTZType options.inferDate => tryParseDateTime(field)
           case TimestampNTZType => tryParseTimestampNTZ(field)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906320780


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   >Do you know what is the advantage of allowing Legacy Formatter?
   
   One benefit of the legacy formatter is that it recognizes some pre-Gregorian leap years (like `1500-02-29`) that exist only in the hybrid Julian calendar. Note how schema inference chooses `string` until you set the legacy parser.
   ```
   scala> val csvInput = Seq("1425-03-22T00:00:00", "2022-01-01T00:00:00", "1500-02-29T00:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
   root
    |-- _c0: string (nullable = true)
   
   scala> sql("set spark.sql.legacy.timeParserPolicy=legacy")
   res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
   root
    |-- _c0: timestamp (nullable = true)
   
   scala> 
   ```
   That, of course, matters only if the application's input comes from legacy systems that still use hybrid Julian, and the input contains pre-Gregorian dates (e.g., for date encoding, which is the only real-world use case I have come across). I would imagine that audience is small and probably getting smaller.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904402104


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -110,15 +116,43 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
   def inferField(typeSoFar: DataType, field: String): DataType = {
     if (field == null || field.isEmpty || field == options.nullValue) {
       typeSoFar
+    } else
+    if (options.inferDate) {
+      val typeElemInfer = typeSoFar match {
+        case NullType => tryParseInteger(field)
+        case IntegerType => tryParseInteger(field)
+        case LongType => tryParseLong(field)
+        case _: DecimalType => tryParseDecimal(field)
+        case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType => tryParseDateTime(field)
+        case TimestampType => tryParseDateTime(field)
+        case BooleanType => tryParseBoolean(field)
+        case StringType => StringType
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      compatibleType(typeSoFar, typeElemInfer).getOrElse(StringType)
     } else {
       val typeElemInfer = typeSoFar match {
         case NullType => tryParseInteger(field)
         case IntegerType => tryParseInteger(field)
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType =>
+          if (options.inferDate) {

Review Comment:
   My bad, I meant to remove the first match expression but missed it during code cleanup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)

Review Comment:
   I this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is DateType. We need logic in `inferField` to try and parse Date type even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1175592703

   > would appreciate it if you summarize any unaddressed comments or concerns. I am fine given that we disable this here by default.
   
   @HyukjinKwon I don't see any remaining unaddressed comments - I believe we are ready to merge this PR. Here are the resolutions to potential concerns:
   
   - No backwards compatibility concern: `Legacy` parser policy cannot be used with `inferDate`. We can open another ticket to add a `LegacyStrictSimpleDateFormatter` [as Bruce suggested](https://github.com/apache/spark/pull/36871#discussion_r905348742) if necessary. 
   - Performance degradation: Benchmark results ([in the bottom of PR description](https://github.com/apache/spark/pull/36871#issue-1271523396)) show that date inference greatly slows down schema inference for CSVs with dates. This feature is guarded by the CSV Option `inferDate` and is off by default. 
   - Logical errors for strict parsing and converting Date to TimestampNTZ have been fixed and have unit tests
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yaohua628 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Yaohua628 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1190921287

   Hi folks, @HyukjinKwon @bersprockets @cloud-fan, thanks for reviewing and some great suggestions, is this PR good to go? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193518865

   > @Jonathancui123 I fixed this issue in [10ca4a4](https://github.com/apache/spark/commit/10ca4a4a09623ef9ea217b0fdf33bbdf2e22c227) on my PR. Can you review? Thanks.
   
   Thanks for the change! I agree with it and I've [left a comment](https://github.com/apache/spark/pull/37147#discussion_r928384084)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [WIP] infer date type in csv

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898222912


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +124,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        // Temporary NOTE: DateTimeType is private to [sql] package
+        case DateType | TimestampNTZType | TimestampType => tryParseDate(field)

Review Comment:
   Rename to `tryParseDate`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +124,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        // Temporary NOTE: DateTimeType is private to [sql] package
+        case DateType | TimestampNTZType | TimestampType => tryParseDate(field)

Review Comment:
   Rename to `tryParseDateTime`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [WIP][SPARK-39469] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898222912


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +124,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        // Temporary NOTE: DateTimeType is private to [sql] package
+        case DateType | TimestampNTZType | TimestampType => tryParseDate(field)

Review Comment:
   Rename to `tryParseDateTime`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918440378


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   Yeah, I don't think legacy mode support this. Let's just throw an exception for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919291362


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to always use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r928322559


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {

Review Comment:
   Agreed. We should address the order. Otherwise, it is unclear how to handle fallback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1171559182

   https://github.com/apache/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L165-L172
   
   Added one last commit^^ to fix the case where `dateFormat` is not specified. When `inferDate` is true, we can set a default value for `dateFormat` to ensure that the correct dateFormatter is used:
   
   https://github.com/apache/spark/blob/8c02823b49a6f28005236e4965a25e664d73ebea/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L182-L184
   
   The `defaultDateFormatter` class doesn't implement strict parsing, but `Iso8601DateFormatter` does


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918153782


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   Does the `Iso8601DateFormatter` properly parse legacy date types? I'm inclined to believe it doesn't since the `getFormatter` returns a special `LegacyDateFormatter` when the `legacyTimeParserPolicy == LEGACY`
   
   https://github.com/apache/spark/blob/9c5c21ccc9c7f41a204a738cd540c14793ffc8cb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L174-L188
   
   cc @bersprockets @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918484209


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   My point is, we can use a dedicated datetime parser when inferring date types, which does not need to respect `legacyTimeParserPolicy`, so there is no need of failing for unsupported parser policy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)

Review Comment:
   I this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is `DateType`. We need logic in `inferField` to try and parse `DateType` even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r924030481


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   I am fine with rolling back 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] brkyvz commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
brkyvz commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906394689


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -169,6 +178,16 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
   private def tryParseDouble(field: String): DataType = {
     if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field)) {
       DoubleType
+    } else if (options.inferDate) {
+      tryParseDateTime(field)
+    } else {
+      tryParseTimestampNTZ(field)
+    }
+  }
+
+  private def tryParseDateTime(field: String): DataType = {
+    if ((allCatch opt dateFormatter.parse(field)).isDefined) {

Review Comment:
   This is a very expensive way



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
+          false, DateTimeUtils.getZoneId("-08:00").toString)
+      // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always
+      // converted to their equivalent UTC timestamp
+      val dateString = "08_09_2001"
+      val expected = dataType match {
+        case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00"))
+        case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks!
   
   I think the PR should be ready to merge. Please let me know if there's anything else we need to fix. I'll keep an eye on the Github Actions results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
+          false, DateTimeUtils.getZoneId("-08:00").toString)
+      // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always
+      // converted to their equivalent UTC timestamp
+      val dateString = "08_09_2001"
+      val expected = dataType match {
+        case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00"))
+        case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks!
   
   I think the PR should be ready to merge. Please let me know if there's anything else we need to fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903189990


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),

Review Comment:
   I addressed inference mistakes in [the following code snippet and comment](https://github.com/apache/spark/pull/36871#discussion_r903186613)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903255239


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -150,6 +151,7 @@ class UnivocityParser(
 
   private val decimalParser = ExprUtils.getDecimalParser(options.locale)
 
+

Review Comment:
   Let's remove this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906408572


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   >I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter 
   
   By the way, to avoid confusion, I meant the above in the context of inferring dates when using the legacy parser (I realize now that this discussion is happening in reference to code changes in `UnivocityParser`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)

Review Comment:
   I think this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is `DateType`. We need logic in `inferField` to try and parse `DateType` even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918142012


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType if options.inferDate => tryParseDateTime(field)

Review Comment:
   Our expected behavior is that in a column with `TimestampType` entries and then `DateType` entries, the column will be inferred as `TimestampType`. 
   
   Here, `tryParseTimestampNTZ` and `tryParseTimestamp` will not be able to parse the `DateType` entries that show up later in the column and the column will be promoted to string type. So we must use `tryParseDateTime` which will give a chance for the date to be parsed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1191234879

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r927447262


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {

Review Comment:
   Just wonder all issues mentioned by @HyukjinKwon in my PR https://github.com/apache/spark/pull/23202#issuecomment-447715090 is addressed by this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193431208

   @sadikovi mind opening a pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193411546

   @Jonathancui123 I addressed this issue in https://github.com/apache/spark/pull/37147/commits/10ca4a4a09623ef9ea217b0fdf33bbdf2e22c227 on my PR. Can you review? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898694396


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -206,28 +218,27 @@ class UnivocityParser(
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              convertToDate(datum)

Review Comment:
   Hm, I don't get this case. If the schema is TimestampType, the output here should always timestamps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1157236737

   cc @bersprockets too if you find some time to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36871: [WIP] infer date type in csv

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1156803934

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906410657


##########
docs/sql-data-sources-csv.md:
##########
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
     <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDate</code></td> 
+    <td>false</td>
+    <td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>) Legacy date formats in Timestamp columns cannot be parsed with this option.</td>

Review Comment:
   By the way, what does this mean?
   
   >Legacy date formats in Timestamp columns cannot be parsed with this option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
URL: https://github.com/apache/spark/pull/36871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193408598

   I think my main confusion comes from the "inferDate=true" turning an invalid timestamp value into a date and then returning it as a timestamp, the column should have been a DateType column.
   
   @Jonathancui123 Would it be possible to revisit this behaviour?
   I am going to fix my PR to work together with inferDate option, no worries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1192185550

   @cloud-fan @Jonathancui123 Wouldn't this patch cause correctness issues? This is what I found when working on https://github.com/apache/spark/pull/37147:
   
   The "SPARK-39469: Infer schema for date type" test in CSVSuite highlights the issue when run together with my patch which attempts to forbid users to fall back to the default parser when the timestamp format is provided as it could lead to correctness issues.
   
   Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as `null`. However, in the test it is returned as `1765-03-28 00:00:00.0`. This is at very least is confusing. Can someone take a look and clarify this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r928322559


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {

Review Comment:
   Agreed. We should address the order. Otherwise, it is unclear how to handle fallback. Fixed here: https://github.com/apache/spark/pull/37147/commits/10ca4a4a09623ef9ea217b0fdf33bbdf2e22c227.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193435389

   It is a small fix so I fixed it in my PR https://github.com/apache/spark/pull/37147.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r908942583


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Thanks Bruce! This is great context! This will definitely be necessary if we want to support inference along with legacy date formats. Users on legacy dates will be unaffected by this change - how about we can open another ticket for date inference with legacy formats if the demand exists (and merge this PR without legacy date inference support)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909035305


##########
docs/sql-data-sources-csv.md:
##########
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
     <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDate</code></td> 
+    <td>false</td>
+    <td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>) Legacy date formats in Timestamp columns cannot be parsed with this option.</td>

Review Comment:
   This was in reference to the fact that only the non-legacy datePaser is used for parsing in timestamp columns - as explained in [the first comment of this thread](https://github.com/apache/spark/pull/36871/files#r903187641).
   
   I'll remove it. This sentence no longer makes sense since we don't support date inference for legacy date formats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919291362


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)

Review Comment:
   Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is DateType. We need logic in `inferField` to try and parse Date type even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906232351


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   @bersprockets Thanks for the suggestion! Do you know what is the advantage of allowing Legacy Formatter? i.e. what is a date format that the legacy formatter can handle but the current formatter cannot?
   
   I'm wondering if there will be a sufficient population of users who want to infer date in schema and also use legacy date formats
   
   cc: @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903186613


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2753,6 +2754,35 @@ abstract class CSVSuite
       }
     }
   }
+
+  test("SPARK-39469: Infer schema for date type") {
+    val options = Map(
+      "header" -> "true",
+      "inferSchema" -> "true",
+      "timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
+      "dateFormat" -> "yyyy-MM-dd",
+      "inferDate" -> "true")
+
+    val results = spark.read
+      .format("csv")
+      .options(options)
+      .load(testFile(dateInferSchemaFile))
+
+    val expectedSchema = StructType(List(StructField("date", DateType),
+      StructField("timestamp-date", TimestampType), StructField("date-timestamp", TimestampType)))
+    assert(results.schema == expectedSchema)
+
+    val expected =
+      Seq(
+        Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 18:30:0.0"),
+          Timestamp.valueOf("1765-03-28 00:00:0.0")),
+        Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 01:01:0.0"),
+          Timestamp.valueOf("1423-11-12 23:41:0.0")),
+        Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 00:00:00.0"),
+          Timestamp.valueOf("2016-01-28 20:00:00.0"))
+      )
+    assert(results.collect().toSeq.map(_.toSeq) == expected)
+  }

Review Comment:
   > One test we might need would be timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.
   
   This test uses: 
   ```
         "timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
         "dateFormat" -> "yyyy-MM-dd",
   ```
   
   This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the `DateFormat` is a prefix of the `TimestampFormat`. 
   @HyukjinKwon @bersprockets 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r899496784


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),

Review Comment:
   >to make sure timestamps are not parsed as date types without conflicting.
   
   That's actually what happens:
   
   Before this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: timestamp (nullable = true)
   
   scala> 
   ```
   After this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: date (nullable = true)
   
   scala>
   ```
   It looks like some tests fail too, like `CSVInferSchemaSuite`,  and `CSVv1Suite`, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).
   
   >We should probably 1. add either SQL configuration or an option e.g., infersDate
   
   I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).
   
   Edit: To do a strict parsing, one might need to use ParsePosition and check that the whole date/time value string was consumed.  Even after setting lenient=false `SimpleDateFormat.parse` didn't complain about extra characters that weren't consumed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903256627


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)
+              } else {
+                throw(e)
+              }
+            }
+        }
+      }
+
+    case _: TimestampNTZType => (d: String) =>
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+        } catch {
+          case NonFatal(e) =>

Review Comment:
   
   This should be good enough
   
   ```scala
             case NonFatal(e) if options.inferDate =>
               daysToMicros(dateFormatter.parse(datum), options.zoneId)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1158329998

   @Jonathancui123 You probably want to turn on github actions so tests will run.
   
   From https://spark.apache.org/contributing.html:
   
   >Go to “Actions” tab on your forked repository and enable “Build and test” and “Report test results” workflows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1192187659

   Maybe it is just the test so I can update that in my PR but I would like to clarify the expected behaviour here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1193016381

   So am I right that Option A considers the type coercion between timestamps and dates, and Option B does not? I personally prefers Option B so we can switch to Option A in the future. Once we pick Option A, it's difficult to go back to Option B.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1174596604

   Seems fine in general but maybe i'm missing some discussions or details here.
   
   @Jonathancui123 would appreciate it if you summarize any unaddressed comments or concerns. I am fine given that we disable this here by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1157236219

   Took a cursory look. @MaxGekk do you remember the context here? I remember we didn't merge this change because the legacy fast format parser (Java 8 libraries) did not support the exact matching (e.g., "yyyy" parses "2000-10-12" as "2000")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r916522350


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType if options.inferDate => tryParseDateTime(field)

Review Comment:
   shall we call `tryParseTimestampNTZ`? The type so far is timestamp, and inferring date is useless as we need to promote it to timestamp anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r920177503


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   @MaxGekk do we have a public doc for the `legacyTimeParserPolicy` option? Does it clearly define its scope?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r927358619


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {

Review Comment:
   I think the issue here is, if the timestamp parsing fails, maybe it's because this is a date, or maybe it's a legacy timestamp format. We need to define the priority here. Since `inferDate` is opt-in, I think it makes more sense to try parsing as date first, then the legacy format.
   
   cc @sadikovi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905348742


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   >We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes.
   
   I think you could still make it work, but you would need a  new extension of `LegacySimpleDateFormatter` (maybe `LegacyStrictSimpleDateFormatter`), with an override like this:
   
   ```
     def parseToDate(s: String): Date = {
       val pp = new ParsePosition(0)
       val res = sdf.parse(s, pp)
       if (s.length != pp.getIndex) {
         throw new RuntimeException(s"$s is not a date")
       }
       res
     }
   ```
   `2001-09-08-randomtext` would not parse, neither would `2022-01-02 12:56:33`, but `2022-01-02` would (assuming a format of `yyyy-MM-dd`).
   
   I assume it would be slow (but I have not tested it).
   
   Maybe not worth the extra code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906513077


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)
+              } else {
+                throw(e)
+              }
+            }
+        }
+      }
+
+    case _: TimestampNTZType => (d: String) =>
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+        } catch {
+          case NonFatal(e) if (options.inferDate) =>
+            daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this:
   ```
   scala> sql("set spark.sql.timestampType=TIMESTAMP_NTZ")
   res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   scala> val options = Map(
     "inferSchema" -> "true",
     "timestampFormat" -> "yyyy/MM/dd HH:mm:ss",
     "timestampNTZFormat" -> "yyyy-MM-dd'T'HH:mm:ss",  
     "dateFormat" -> "yyyy-MM-dd",
     "inferDate" -> "true")
   
   options: scala.collection.immutable.Map[String,String] = Map(inferSchema -> true, timestampFormat -> yyyy/MM/dd HH:mm:ss, timestampNTZFormat -> yyyy-MM-dd'T'HH:mm:ss, dateFormat -> yyyy-MM-dd, inferDate -> true)
   
   scala> 
   scala> val csvInput = Seq("2022-01-01T00:00:00", "2022-06-22").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.options(options).csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: timestamp_ntz]
   
   scala> df.show(false)
   +-------------------+
   |_c0                |
   +-------------------+
   |2022-01-01 00:00:00|
   |2022-06-22 07:00:00|
   +-------------------+
   
   scala> 
   ```
   Note `2022-06-22` becomes `2022-06-22 07:00:00`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906439733


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -23,6 +23,12 @@
     ],
     "sqlState" : "22005"
   },
+  "CANNOT_INFER_DATE" : {
+    "message" : [
+      "Cannot infer date in schema inference when LegacyTimeParserPolicy is \"LEGACY\". Legacy Date formatter does not support strict date format matching which is required to avoid inferring timestamps and other non-date entries to date."
+    ],
+    "sqlState" : "22005"

Review Comment:
   Changed to 22007 for date format error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898693212


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),

Review Comment:
   One test we might need would be:
   
   `timestampFormat" -> "dd/MM/yyyy HH:mm` and `dateFormat -> dd/MM/yyyy` to make sure timestamps are not parsed as date types without conflicting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898692164


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -169,6 +174,14 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
   private def tryParseDouble(field: String): DataType = {
     if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field)) {
       DoubleType
+    } else {
+      tryParseDateTime(field)
+    }
+  }
+
+  private def tryParseDateTime(field: String): DataType = {
+    if ((allCatch opt dateFormatter.parse(field)).isDefined) {

Review Comment:
   We should probably 1. add  either SQL configuration or an option e.g., `infersDate` (like `prefersDecimal` in JSON options), or 2. only infers date types when `dateFormat` is specified.  The main idea is to avoid perf regression in schema inference by adding this.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903184632


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -206,28 +218,27 @@ class UnivocityParser(
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              convertToDate(datum)

Review Comment:
   Consider the column of a `DateType` followed by a `TimestampType`. We would expect this column to be inferred as a `TimestampType` column. 
   
   Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903255163


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -110,15 +116,43 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
   def inferField(typeSoFar: DataType, field: String): DataType = {
     if (field == null || field.isEmpty || field == options.nullValue) {
       typeSoFar
+    } else
+    if (options.inferDate) {
+      val typeElemInfer = typeSoFar match {
+        case NullType => tryParseInteger(field)
+        case IntegerType => tryParseInteger(field)
+        case LongType => tryParseLong(field)
+        case _: DecimalType => tryParseDecimal(field)
+        case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType => tryParseDateTime(field)
+        case TimestampType => tryParseDateTime(field)
+        case BooleanType => tryParseBoolean(field)
+        case StringType => StringType
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      compatibleType(typeSoFar, typeElemInfer).getOrElse(StringType)
     } else {
       val typeElemInfer = typeSoFar match {
         case NullType => tryParseInteger(field)
         case IntegerType => tryParseInteger(field)
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
-        case TimestampNTZType => tryParseTimestampNTZ(field)
-        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDateTime(field)
+        case TimestampNTZType =>
+          if (options.inferDate) {

Review Comment:
   Hm, seems this logic cannot be reached, and duplicates the case-match above. Can we merge this if-else with above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903187641


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error. 
   
   e.g. The legacy DateFormatter will parse the following string without throwing an error:
   dateFormat: `yyyy-mm-dd`
   string: `2001-09-08-randomtext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903186613


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2753,6 +2754,35 @@ abstract class CSVSuite
       }
     }
   }
+
+  test("SPARK-39469: Infer schema for date type") {
+    val options = Map(
+      "header" -> "true",
+      "inferSchema" -> "true",
+      "timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
+      "dateFormat" -> "yyyy-MM-dd",
+      "inferDate" -> "true")
+
+    val results = spark.read
+      .format("csv")
+      .options(options)
+      .load(testFile(dateInferSchemaFile))
+
+    val expectedSchema = StructType(List(StructField("date", DateType),
+      StructField("timestamp-date", TimestampType), StructField("date-timestamp", TimestampType)))
+    assert(results.schema == expectedSchema)
+
+    val expected =
+      Seq(
+        Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 18:30:0.0"),
+          Timestamp.valueOf("1765-03-28 00:00:0.0")),
+        Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 01:01:0.0"),
+          Timestamp.valueOf("1423-11-12 23:41:0.0")),
+        Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 00:00:00.0"),
+          Timestamp.valueOf("2016-01-28 20:00:00.0"))
+      )
+    assert(results.collect().toSeq.map(_.toSeq) == expected)
+  }

Review Comment:
   > One test we might need would be timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.
   
   This test uses: 
   ```
         "timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
         "dateFormat" -> "yyyy-MM-dd",
   ```
   
   This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the `DateFormat` is a prefix of the `TimestampFormat`. 
   
   Thank you for the review! @HyukjinKwon @bersprockets 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903258619


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +199,50 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+              // There may be date type entries in timestamp column due to schema inference
+              if (options.inferDate) {
+                daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Yeah, I think it makes sense to throw an exception or disallow when legacy parser is used (that doesn't care about surffixes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r916523583


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   instead of failing here, shall we directly use `Iso8601DateFormatter` to infer date type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919399248


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   >to always use the non-legacy parser for inference and allowing inferDate=true with legacyTimeParserPolicy = LEGACY. What do you think?
   
   It's a little weird that you can specify `inferDate=true` with `legacyTimeParserPolicy=LEGACY`, yet Spark won't properly infer the type of legacy dates, e.g.:
   ```
   scala> sql("set spark.sql.legacy.timeParserPolicy=legacy")
   res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   scala> // 1500-02-29 is a legacy date
   
   scala> val csvInput = Seq("1425-03-22", "2022-01-01", "1500-02-29").toDS
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> // infers as string
   
   scala> spark.read.options(Map("inferSchema" -> "true", "inferDate" -> "true", "dateFormat" -> "yyyy-MM-dd")).csv(csvInput).printSchema
   root
    |-- _c0: string (nullable = true)
   
   
   scala> // if you specify a schema, it parses just fine, although 1500-02-29 becomes 1500-03-01 (expected)
   
   scala> spark.read.schema("dt date").options(Map("dateFormat" -> "yyyy-MM-dd")).csv(csvInput).show(false)
   +----------+
   |dt        |
   +----------+
   |1425-03-22|
   |2022-01-01|
   |1500-03-01|
   +----------+
   
   scala> // remove the legacy date from the input...
   
   scala> val csvInput = Seq("1425-03-22", "2022-01-01").toDS
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> // .. and *then* Spark will infer date type
   
   scala> spark.read.options(Map("inferSchema" -> "true", "inferDate" -> "true", "dateFormat" -> "yyyy-MM-dd")).csv(csvInput).printSchema
   root
    |-- _c0: date (nullable = true)
   ```
   If the user doesn't have any legacy dates in their input, it will work just fine, but then I am not sure why the user would be specifying `legacyTimeParserPolicy=LEGACY`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r924026412


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {

Review Comment:
   @HyukjinKwon Do you have a suggestion for whether we can merge with this latest change or if we should roll it back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r927447262


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -197,34 +198,46 @@ class UnivocityParser(
         Decimal(decimalParser(datum), dt.precision, dt.scale)
       }
 
-    case _: TimestampType => (d: String) =>
+    case _: DateType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          timestampFormatter.parse(datum)
+          dateFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
+            DateTimeUtils.stringToDate(str).getOrElse(throw e)
         }
       }
 
-    case _: TimestampNTZType => (d: String) =>
-      nullSafeDatum(d, name, nullable, options) { datum =>
-        timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-      }
-
-    case _: DateType => (d: String) =>
+    case _: TimestampType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
         try {
-          dateFormatter.parse(datum)
+          timestampFormatter.parse(datum)
         } catch {
           case NonFatal(e) =>
             // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
             // compatibility.
             val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {

Review Comment:
   Just wonder, all issues mentioned by @HyukjinKwon in my PR https://github.com/apache/spark/pull/23202#issuecomment-447715090 have been addressed by this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898692835


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
+          false, "UTC")
+      val dateString = "08_09_2001"
+      val date = days(2001, 9, 8)
+      val parser = new UnivocityParser(new StructType(), timestampsOptions)
+      assert(parser.makeConverter("d", dataType).apply(dateString) == date)
+    }
+    checkDate(TimestampType)
+    checkDate(TimestampNTZType)

Review Comment:
   Should we test DateType here too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898692373


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {

Review Comment:
   Let's add a prefix "SPARK-39469: ...", see also https://spark.apache.org/contributing.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909035305


##########
docs/sql-data-sources-csv.md:
##########
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
     <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDate</code></td> 
+    <td>false</td>
+    <td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>) Legacy date formats in Timestamp columns cannot be parsed with this option.</td>

Review Comment:
   This was in reference to the fact that only the non-legacy datePaser is used for parsing in timestamp columns - as explained in [the first comment of this thread](https://github.com/apache/spark/pull/36871/files#r903187641).
   
   That sentence no longer makes sense since we don't support date inference for legacy date formats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
+          false, DateTimeUtils.getZoneId("-08:00").toString)
+      // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always
+      // converted to their equivalent UTC timestamp
+      val dateString = "08_09_2001"
+      val expected = dataType match {
+        case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00"))
+        case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1162607022

   The logic looks making sense in general .. but would be best to have second look from @bersprockets @MaxGekk @gengliangwang  ....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904619384


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -529,6 +529,15 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
        """.stripMargin)
   }
 
+  def inferDateWithLegacyTimeParserError(): Throwable = {
+    new IllegalArgumentException(

Review Comment:
   Yes, please use SparkIllegalArgumentException



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904450412


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -529,6 +529,15 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
        """.stripMargin)
   }
 
+  def inferDateWithLegacyTimeParserError(): Throwable = {
+    new IllegalArgumentException(

Review Comment:
   For user-facing errors, it should inherit `SparkThrowable`. cc @gengliangwang @MaxGekk @cloud-fan fyi



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Posted by GitBox <gi...@apache.org>.
Jonathancui123 commented on PR #36871:
URL: https://github.com/apache/spark/pull/36871#issuecomment-1172702489

   I've updated the PR description to include benchmarking results. `inferDate`=`true` is bad for large CSVs with many timestamp columns. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org