You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/28 15:14:55 UTC

[GitHub] [spark] MaxGekk opened a new pull request #35042: [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata

MaxGekk opened a new pull request #35042:
URL: https://github.com/apache/spark/pull/35042


   ### What changes were proposed in this pull request?
   In the PR, I propose to add new metadata key `org.apache.spark.timeZone` which Spark writes to Parquet/Avro matadata while performing of datetimes rebase in the `LEGACY` mode (see the SQL configs:
   - `spark.sql.parquet.datetimeRebaseModeInWrite`,
   - `spark.sql.parquet.int96RebaseModeInWrite` and
   - `spark.sql.avro.datetimeRebaseModeInWrite`).
   
   The writers uses the current session time zone (see the SQL config `spark.sql.session.timeZone`) in rebasing of Parquet/Avro timestamp columns.
   
   At the reader side, Spark tries to get info about the writer's time zone from the new metadata property:
   ```
   $ java -jar ~parquet-tools-1.12.0.jar meta ./part-00000-b0d90bf0-ce60-4b4f-b453-b33f61ab2b2a-c000.snappy.parquet
   ...
   extra:       org.apache.spark.timeZone = America/Los_Angeles
   extra:       org.apache.spark.legacyDateTime =
   ```
   and use it in rebasing timestamps to the Proleptic Gregorian calendar. In the case when the reader cannot retrieve the original time zone from Parquet/Avro metadata, it uses the default JVM time zone for backward compatibility.
   
   ### Why are the changes needed?
   Before the changes, Spark assumes that a writer uses the default JVM time zone while rebasing of dates/timestamps. And if a reader and the writer have different JVM time zone settings, the reader cannot load such columns in the `LEGACY` mode correctly. So, the reader will have full info about writer settings after the changes.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. After the changes, Parquet/Avro writers use the session time zone while timestamp rebasing in the `LEGACY` mode instead of the default JVM time zone. Need to highlight that the session time zone is set to the JVM time zone by default.
   
   ### How was this patch tested?
   1. By running new tests:
   ```
   $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
   $ build/sbt "test:testOnly *ParquetRebaseDatetimeV2Suite"
   $ build/sbt "test:testOnly *AvroV1Suite"
   $ build/sbt "test:testOnly *AvroV2Suite"
   ```
   2. And related existing test suites:
   ```
   $ build/sbt "test:testOnly *DateTimeUtilsSuite"
   $ build/sbt "test:testOnly *RebaseDateTimeSuite"
   $ build/sbt "test:testOnly *TimestampFormatterSuite"
   $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroCatalystDataConversionSuite"
   $ build/sbt "test:testOnly *AvroRowReaderSuite"
   $ build/sbt "test:testOnly *AvroSerdeSuite"
   $ build/sbt "test:testOnly *ParquetVectorizedSuite"
   ```
   
   3. Also modified the test `SPARK-31159: rebasing timestamps in write` to check loading timestamps in the LEGACY mode when the session time zone and JVM time zone are different.
   
   4. Generated parquet files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31806: generate test files for checking compatibility with Spark 2.4"`. The parquet files don't contain info about the original time zone:
   ```
   $ java -jar ~/Downloads/parquet-tools-1.12.0.jar meta sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet
   file:        file:/Users/maximgekk/proj/parquet-rebase-save-tz/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet
   creator:     parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d)
   extra:       org.apache.spark.version = 3.2.0
   extra:       org.apache.spark.legacyINT96 =
   extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"dict","type":"timestamp","nullable":true,"metadata":{}},{"name":"plain","type":"timestamp","nullable":true,"metadata":{}}]}
   extra:       org.apache.spark.legacyDateTime =
   
   file schema: spark_schema
   --------------------------------------------------------------------------------
   dict:        OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1
   plain:       OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1
   ```
   By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed parquet files generated by Spark 2.4.5/2.4.6 and 3.2.0/master.
   
   5. Generated avro files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31855: generate test files for checking compatibility with Spark 2.4"`. The avro files don't contain info about the original time zone:
   ```
   $ java -jar ~/Downloads/avro-tools-1.9.2.jar getmeta external/avro/src/test/resources/before_1582_timestamp_micros_v3_2_0.avro
   avro.schema	{"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]}
   org.apache.spark.version	3.2.0
   avro.codec	snappy
   org.apache.spark.legacyDateTime
   ```
   By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed avro files generated by Spark 2.4.5/2.4.6 and 3.2.0/master.
   
   Authored-by: Max Gekk <ma...@gmail.com>
   Signed-off-by: Wenchen Fan <we...@databricks.com>
   (cherry picked from commit ef3a47038606ea426c15844b0400f5141acd5108)
   Signed-off-by: Max Gekk <ma...@gmail.com>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #35042: [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #35042:
URL: https://github.com/apache/spark/pull/35042


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35042: [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35042:
URL: https://github.com/apache/spark/pull/35042#issuecomment-1002343731


   Merged to branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org