You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2024/01/18 16:33:38 UTC

[PR] [WIP][SQL] Don't use the NTZ parser for inferring TIMESTAMP_LTZ in CSV [spark]

MaxGekk opened a new pull request, #44789:
URL: https://github.com/apache/spark/pull/44789

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44789:
URL: https://github.com/apache/spark/pull/44789#discussion_r1458649995


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -202,11 +202,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     // We can only parse the value as TimestampNTZType if it does not have zone-offset or
     // time-zone component and can be parsed with the timestamp formatter.
     // Otherwise, it is likely to be a timestamp with timezone.
-    val timestampType = SQLConf.get.timestampType
-    if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
-        timestampType == TimestampNTZType) &&
-        timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
-      timestampType
+    if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
+      SQLConf.get.timestampType

Review Comment:
   I think it's literally wrong to infer a value as LTZ type by using the NTZ parser.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44789:
URL: https://github.com/apache/spark/pull/44789#discussion_r1458636982


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -202,11 +202,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     // We can only parse the value as TimestampNTZType if it does not have zone-offset or
     // time-zone component and can be parsed with the timestamp formatter.
     // Otherwise, it is likely to be a timestamp with timezone.
-    val timestampType = SQLConf.get.timestampType
-    if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
-        timestampType == TimestampNTZType) &&
-        timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
-      timestampType
+    if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
+      SQLConf.get.timestampType

Review Comment:
   Does this assume the LTZ parser can parse NTZ values? But it isn't true if LTZ and NTZ have different parsing patterns?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -202,11 +202,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     // We can only parse the value as TimestampNTZType if it does not have zone-offset or
     // time-zone component and can be parsed with the timestamp formatter.
     // Otherwise, it is likely to be a timestamp with timezone.
-    val timestampType = SQLConf.get.timestampType
-    if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
-        timestampType == TimestampNTZType) &&
-        timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
-      timestampType
+    if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
+      SQLConf.get.timestampType

Review Comment:
   Does this assume the LTZ parser can parse NTZ values? But it isn't true if LTZ and NTZ have different parsing patterns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix inferring TIMESTAMP_NTZ in JSON/CSV [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44789:
URL: https://github.com/apache/spark/pull/44789#issuecomment-1899827649

   also cc @Hisoka-X @sadikovi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #44789: [SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV
URL: https://github.com/apache/spark/pull/44789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix inferring TIMESTAMP_NTZ in JSON/CSV [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44789:
URL: https://github.com/apache/spark/pull/44789#discussion_r1458402959


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -202,11 +202,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
     // We can only parse the value as TimestampNTZType if it does not have zone-offset or
     // time-zone component and can be parsed with the timestamp formatter.
     // Otherwise, it is likely to be a timestamp with timezone.
-    val timestampType = SQLConf.get.timestampType
-    if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
-        timestampType == TimestampNTZType) &&
-        timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
-      timestampType
+    if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
+      SQLConf.get.timestampType

Review Comment:
   Restored to the state of https://github.com/apache/spark/pull/40022



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix inferring TIMESTAMP_NTZ in JSON/CSV [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44789:
URL: https://github.com/apache/spark/pull/44789#discussion_r1458404759


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala:
##########
@@ -267,7 +267,9 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
   test("SPARK-45433: inferring the schema when timestamps do not match specified timestampFormat" +
     " with only one row") {
     val options = new CSVOptions(
-      Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss"),
+      Map(
+        "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
+        "timestampNTZFormat" -> "yyyy-MM-dd HH:mm:ss"),

Review Comment:
   To infer the STRING type, the input must not match to **both** formats: `TIMESTAMP` and `TIMESTAMP_NTZ`. Just set `timestampFormat` is not enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44789:
URL: https://github.com/apache/spark/pull/44789#discussion_r1458648993


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala:
##########
@@ -267,7 +267,9 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
   test("SPARK-45433: inferring the schema when timestamps do not match specified timestampFormat" +
     " with only one row") {
     val options = new CSVOptions(
-      Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss"),
+      Map(
+        "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
+        "timestampNTZFormat" -> "yyyy-MM-dd HH:mm:ss"),

Review Comment:
   I don't agree. A query fails if an option is not set is not really a good behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org