You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "gengliangwang (via GitHub)" <gi...@apache.org> on 2024/03/20 21:54:14 UTC

[PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

gengliangwang opened a new pull request, #45621:
URL: https://github.com/apache/spark/pull/45621

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.
   
   Note: With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   1. Consistency with the behavior of CSV, JSON, ORC, and JDBC data sources. This also makes the schema inference behavior simpler.
   2. When using `insert overwrite` or `replace table` over external parquet files into a Delta or Iceberg table, the result schema is changed from TimestampType to TimestampNTZType. The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes, since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type.
   
   With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Existing UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   Yes, there are some doc suggestion from copilot in `docs/sql-migration-guide.md`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010991405

   I should add we have custom readers and writers for parquet so this parameter doesn't effect us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1533361364


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   Just my two cents. Before reading this PR just now, I actually didn't know we're inferring `TIMESTAMP_NTZ` by default (although it's only when reading the Parquet files written not by Spark). Unless we want to do this for the whole (e.g., switching Timestamp to TimestampNTZ everywhere), I think we should have not enabled this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010725475

   To @gengliangwang , I believe it would be better if you can send a short email to dev@spark in order not to surprise anyone.
   
   > I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions. cc @sadikovi as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532935416


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Well, could you handle this as an independent JIRA first, @gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532937164


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   We need a discussion about this breaking change in Apache Spark 4.0.0, @gengliangwang .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532951714


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Sure, created https://github.com/apache/spark/pull/45623 for this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010988889

   @viirya Thanks, could you provide more details about the iceberg workaround? 
   
   As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via
   ```
   ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported')
   ```
   
   After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011021425

   > @viirya Thanks, could you provide more details about the iceberg workaround?
   > 
   > As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via
   > 
   > ```
   > ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported')
   > ```
   > 
   > After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case.
   
   Not sure if there are definitely related, but I remember there are configs like
   
   ```
   .config("spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables", "true")
   .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011026947

   Yes those parameters only change the Iceberg schema. What your are seeing is the note that the Iceberg Schema cannot be read into the Spark Type without a special parameter. Changing the default for the writer still doesn't effect us. 
   
   We had that issue previously because Trino and other engines have no problem creating a table with that type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010712125

   I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions.
   cc @sadikovi as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010929308

   @dongjoon-hyun yes, it is a tough decision. I am now marking this PR as WIP and will decide whether we should continue on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011010486

   @RussellSpitzer The same pattern can happen on iceberg if there are multiple Spark versions in the pipeline
   On Spark 3.3+
   ```
   // The following can be Replace table as select as well
   spark.read.parquet(...).write.mode("overwrite").format("iceberg").saveAsTable("iceberg_table")
   ```
   
   On Spark 3.2 and earlier:
   ```
   > select * from iceberg_ts 
   
   java.lang.IllegalArgumentException: Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'spark.sql.iceberg.handle-timestamp-without-timezone' to true. This will not change the underlying values stored but will change their displayed values in Spark. For more information please see https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010990874

   I don't think this is an issue for Iceberg, we always would store values the same and then mark in the schema the difference in type. Readers would then get values based on the table schema and not what was in the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532932026


##########
docs/sql-migration-guide.md:
##########
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(<str>, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+    
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Another option is to change the conf in the latest 3.3/3.4/3.5 releases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010724234

   I understand the exposed surface is limited as mentioned by @gengliangwang , but any Parquet change causes significant impact to all Spark users because Apache Parquet is the default format. 
   
   cc @mridulm , @sunchao , @viirya , @peter-toth , @RussellSpitzer , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010971958

   > The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it.
   
   Hmm, I quite recall that in Iceberg there seems some configs can be used to read such fields as workaround? I'm not sure about Delta on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1534394202


##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected.
+- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   Thanks! I will collect more comments from this PR before I start a thread in the dev list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org