You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/11/02 21:22:38 UTC

Re: [I] Wrong timestamp type read while from parquet file created by spark [arrow-datafusion]

comphead commented on issue #7958:
URL: https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1791556116

   `arrow-rs` treats INT96 Parquet type as `Timestamp(NanoSecond)`
   https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97 
   
   Interesting explanation in Snowflake of the same issue
   https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file
   
   Key takeaways
   - INT96 Parquet field is deprecated https://issues.apache.org/jira/browse/PARQUET-323
   - INT96 is only used to represent **nanosec** timestamp
   - Apache projects like Hive and Spark still incorrectly treats the first 16 bytes, hence it returned what users thought was the correct value, but in fact it is incorrect.
   
   That  is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark.
   
   What are your thoughts? @alamb @waitingkuo @tustvold @viirya 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org