You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/09 05:11:00 UTC

[jira] [Commented] (SPARK-36958) Reading of legacy timestamps from Parquet confusing in Spark 3, related config values don't seem working

    [ https://issues.apache.org/jira/browse/SPARK-36958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426495#comment-17426495 ] 

Hyukjin Kwon commented on SPARK-36958:
--------------------------------------

[~dgoldenberg] doesn't it work if you set it to LEGACY? it would be great if you post a self-contained reproducer together.

> Reading of legacy timestamps from Parquet confusing in Spark 3, related config values don't seem working
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36958
>                 URL: https://issues.apache.org/jira/browse/SPARK-36958
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2
>         Environment: emr-6.4.0
> spark 3.1.2
>            Reporter: Dmitry Goldenberg
>            Priority: Major
>
> I'm having a major issue with trying to run in Spark 3, reading parquet data that got generated with Spark 2.4.
> The full stack trace is below.
> The error message is very confusing:
>  # I do not have dates that before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z
>  # The documentation does not state clearly how to work around/fix this issue. What exactly is the difference between the LEGACY and CORRECTED values of the config settings?
>  # Which of the following would I want to set and to what values? - spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> - spark.sql.legacy.parquet.datetimeRebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.timeParserPolicy
>  # I've tried setting these to CORRECTED,CORRECTED,CORRECTED,CORRECTED, and LEGACY, respectively, and got the same error (see the stack trace).
> The issues that I see with this:
>  # Lack of thorough clear documentation on what this is and how it's meant to work.
>  # The confusing error message.
>  # The fact that the error still occurs even when you set the config values.
>  
> {quote} py4j.protocol.Py4JJavaError: An error occurred while calling o1134.count.py4j.protocol.Py4JJavaError: An error occurred while calling o1134.count.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 36.0 failed 4 times, most recent failure: Lost task 8.3 in stage 36.0 (TID 619) (ip-10-2-251-59.awsinternal.audiomack.com executor 2): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159) at org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseTimestamp(VectorizedColumnReader.java:228) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseInt96(VectorizedColumnReader.java:242) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:662) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:300) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org