You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "gengliangwang (via GitHub)" <gi...@apache.org> on 2024/03/20 21:54:14 UTC

[PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

gengliangwang opened a new pull request, #45621:
URL: https://github.com/apache/spark/pull/45621

### What changes were proposed in this pull request?

Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Note: With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5.

### Why are the changes needed?

1. Consistency with the behavior of CSV, JSON, ORC, and JDBC data sources. This also makes the schema inference behavior simpler.
2. When using `insert overwrite` or `replace table` over external parquet files into a Delta or Iceberg table, the result schema is changed from TimestampType to TimestampNTZType. The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it.

### Does this PR introduce _any_ user-facing change?

Yes, since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type.

With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5.

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

Yes, there are some doc suggestion from copilot in `docs/sql-migration-guide.md`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010991405

   I should add we have custom readers and writers for parquet so this parameter doesn't effect us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1533361364


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   Just my two cents. Before reading this PR just now, I actually didn't know we're inferring `TIMESTAMP_NTZ` by default (although it's only when reading the Parquet files written not by Spark). Unless we want to do this for the whole (e.g., switching Timestamp to TimestampNTZ everywhere), I think we should have not enabled this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010725475

   To @gengliangwang , I believe it would be better if you can send a short email to dev@spark in order not to surprise anyone.
   
   > I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions. cc @sadikovi as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532935416


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Well, could you handle this as an independent JIRA first, @gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532937164


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   We need a discussion about this breaking change in Apache Spark 4.0.0, @gengliangwang .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532951714


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Sure, created https://github.com/apache/spark/pull/45623 for this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010988889

   @viirya Thanks, could you provide more details about the iceberg workaround? 
   
   As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via
   ```
   ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported')
   ```
   
   After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011021425

   > @viirya Thanks, could you provide more details about the iceberg workaround?
   > 
   > As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via
   > 
   > ```
   > ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported')
   > ```
   > 
   > After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case.
   
   Not sure if there are definitely related, but I remember there are configs like
   
   ```
   .config("spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables", "true")
   .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011026947

   Yes those parameters only change the Iceberg schema. What your are seeing is the note that the Iceberg Schema cannot be read into the Spark Type without a special parameter. Changing the default for the writer still doesn't effect us. 
   
   We had that issue previously because Trino and other engines have no problem creating a table with that type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010712125

   I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions.
   cc @sadikovi as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010929308

   @dongjoon-hyun yes, it is a tough decision. I am now marking this PR as WIP and will decide whether we should continue on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011010486

   @RussellSpitzer The same pattern can happen on iceberg if there are multiple Spark versions in the pipeline
   On Spark 3.3+
   ```
   // The following can be Replace table as select as well
   spark.read.parquet(...).write.mode("overwrite").format("iceberg").saveAsTable("iceberg_table")
   ```
   
   On Spark 3.2 and earlier:
   ```
   > select * from iceberg_ts 
   
   java.lang.IllegalArgumentException: Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'spark.sql.iceberg.handle-timestamp-without-timezone' to true. This will not change the underlying values stored but will change their displayed values in Spark. For more information please see https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010990874

   I don't think this is an issue for Iceberg, we always would store values the same and then mark in the schema the difference in type. Readers would then get values based on the table schema and not what was in the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532932026


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Another option is to change the conf in the latest 3.3/3.4/3.5 releases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010724234

   I understand the exposed surface is limited as mentioned by @gengliangwang , but any Parquet change causes significant impact to all Spark users because Apache Parquet is the default format. 
   
   cc @mridulm , @sunchao , @viirya , @peter-toth , @RussellSpitzer , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010971958

   > The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it.
   
   Hmm, I quite recall that in Iceberg there seems some configs can be used to read such fields as workaround? I'm not sure about Delta on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1534394202


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   Thanks! I will collect more comments from this PR before I start a thread in the dev list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org