You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "rtyler (via GitHub)" <gi...@apache.org> on 2023/04/13 04:33:45 UTC

[GitHub] [arrow-rs] rtyler opened a new issue, #4075: Parquet reader of Int96 columns and coercion to timestamps

rtyler opened a new issue, #4075:
URL: https://github.com/apache/arrow-rs/issues/4075

   **Which part is this question about**
   
   I am using the parquet crate through delta-rs and trying to understand the disconnect between Delta's interpretation of `timestamp` and parquet. For example, [Delta considers timestamps as microseconds since epoch](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types)
   
   
   **Describe your question**
   
   The parquet format docs have a [dedicated timestamp type](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types) which I don't believe Delta is using. The parquet files written by [Delta](https://github.com/delta-io/delta) (the Spark implementation) write out an int96 type.
   
   The `parquet-tools` CLI shows the column type from a `.parquet` file as:
   
   ```
   ############ Column(timestamp) ############
   name: timestamp
   path: timestamp
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT96
   logical_type: None
   converted_type (legacy): NONE
   compression: SNAPPY (space_saved: 13%)
   ```
   
   When I modify the `read_parquet.rs` example, the schema of `RecordBatch` coming from an example file with the above column is:
   
   ```
   Field { name: "timestamp", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
   ```
   
   I am assuming that the code which is doing this conversation on the INT96 column to a timezone is in `consume_batch` within `primitive_array.rs` but I'm not entirely sure.
   
   
   I'm hoping for some help figuring out where the disconnect might be between how Delta Lake thinks "timestamp" should look (microseconds) versus the Parquet Rust reader which coerces that INT96 to nanoseconds.
   
   I'm trying to figure out 
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] rtyler closed issue #4075: Parquet reader of Int96 columns and coercion to timestamps

Posted by "rtyler (via GitHub)" <gi...@apache.org>.

rtyler closed issue #4075: Parquet reader of Int96 columns and coercion to timestamps
URL: https://github.com/apache/arrow-rs/issues/4075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4075: Parquet reader of Int96 columns and coercion to timestamps

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4075:
URL: https://github.com/apache/arrow-rs/issues/4075#issuecomment-1506626506

   https://github.com/apache/arrow-datafusion/issues/5950 may be related here, FYI @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] rtyler commented on issue #4075: Parquet reader of Int96 columns and coercion to timestamps

Posted by "rtyler (via GitHub)" <gi...@apache.org>.

rtyler commented on issue #4075:
URL: https://github.com/apache/arrow-rs/issues/4075#issuecomment-1507480870

   [This link to Apache Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L970-L980) code was shared with me, and it makes me so sad.
   
   Thanks for the input @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4075: Parquet reader of Int96 columns and coercion to timestamps

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4075:
URL: https://github.com/apache/arrow-rs/issues/4075#issuecomment-1506625004

   The parquet reader is returning nanoseconds because that is the precision present in the encoding. I'm not familiar with deltalake's timestamp handling but it may be they assume all timestamps are microseconds. As this is not actually true, delta-rs should probably be adding coercion logic to convert where appropriate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] rtyler commented on issue #4075: Parquet reader of Int96 columns and coercion to timestamps

Posted by "rtyler (via GitHub)" <gi...@apache.org>.

rtyler commented on issue #4075:
URL: https://github.com/apache/arrow-rs/issues/4075#issuecomment-1507107061

   > FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it
   
   Well that makes me sad :laughing: but I'm not surprised. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org