You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/11/02 21:22:38 UTC
Re: [I] Wrong timestamp type read while from parquet file created by spark [arrow-datafusion]
comphead commented on issue #7958:
URL: https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1791556116
`arrow-rs` treats INT96 Parquet type as `Timestamp(NanoSecond)`
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97
Interesting explanation in Snowflake of the same issue
https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file
Key takeaways
- INT96 Parquet field is deprecated https://issues.apache.org/jira/browse/PARQUET-323
- INT96 is only used to represent **nanosec** timestamp
- Apache projects like Hive and Spark still incorrectly treats the first 16 bytes, hence it returned what users thought was the correct value, but in fact it is incorrect.
That is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark.
What are your thoughts? @alamb @waitingkuo @tustvold @viirya
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org