You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sundy-li (via GitHub)" <gi...@apache.org> on 2023/06/01 02:50:17 UTC

[GitHub] [arrow-rs] sundy-li opened a new issue, #4328: Add util function to convert from `ParquetStatistics` to `ArrayRef`

sundy-li opened a new issue, #4328:
URL: https://github.com/apache/arrow-rs/issues/4328

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   -->
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   Add util function to convert from `ParquetStatistics` to `ArrayRef`
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   arrow-datafusion has a util trait `PruningStatistics` that converts `RowGroupPruningStatistics` into `ArrayRef` used to prune the blocks.
   
   https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_plan/file_format/parquet/row_groups.rs#L229
   
   
   But the util function like `get_min_max_values`  will convert the `statistics`into datafusion's `ScalarValue` and convert it back into `ArrayRef` which seems very redundant because it could be done without datafusion.
   
   So I suggest that arrow-rs could support this trait like [arrow2 did](https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/statistics/mod.rs#LL445C1-L445C1)
   
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add function that converts from parquet statistics `ParquetStatistics` to arrow arrays `ArrayRef` [arrow-rs]

Posted by "opensourcegeek (via GitHub)" <gi...@apache.org>.
opensourcegeek commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099208518

   Just wondering if there's anything left to do to address this issue please? If so, I'm happy to pick this up if that's ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4328: Add util function to convert from `ParquetStatistics` to `ArrayRef`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-1587286441

   @tustvold  notes that the translation from parquet data model to arrow data model is quite tricky -- and he may have time to do this in a week or two


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4328: Add util function to convert from `ParquetStatistics` to `ArrayRef`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-1587274842

   This would also help us in IOx (see https://github.com/influxdata/influxdb_iox/issues/7470, per @crepererum )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4328: Add util function to convert from `ParquetStatistics` to `ArrayRef`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-1587283067

   It would also be aweome to get a similar treatment for Data pages (the so called "page index" values)
   
   Maybe we could add something to the `parquet` with the `arrow` featre: https://docs.rs/parquet/41.0.0/parquet/arrow/index.html
   
   That could read the parquet statistics as an ArrayRef


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add function that converts from parquet statistics `ParquetStatistics` to arrow arrays `ArrayRef` [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223

   > Just wondering if there's anything left to do to address this issue please? If so, I'm happy to pick this up if that's ok.
   
   That would be amazing -- thank you very much @opensourcegeek 
   
   What I think would be idea is an an API in `parquet::arrow` that looks like this:
   
   ```rust
   /// statistics extracted from `Statistics` as Arrow `ArrayRef`s
   ///
   /// # Note:
   /// If the corresponding `Statistics` is not present, or has no information for 
   /// a column, a NULL is present in the  corresponding array entry
   pub struct ArrowStatistics {
     /// min values
     min: ArrayRef,
     /// max values
     max: ArrayRef,
     /// Row counts (UInt64Array)
     row_count: ArrayRef,
     /// Null Counts (UInt64Array)
     null_count: ArrayRef,
   }
   
   // (TODO accessors for min/max/row_count/null_count)
   
   /// Extract `ArrowStatistics` from the  parquet [`Statistics`]
   pub fn parquet_stats_to_arrow(
       arrow_datatype: &DataType,
       statistics: impl IntoIterator<Item = Option<&Statistics>>
   ) -> Result<ArrowStatisics> {
     todo!()
   }
   ```
   
   (This is similar to the existing API [parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#))
   
   Note it is this [`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html) 
   
   There is a version of this code here in DataFusion that could perhaps be` adapted: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186
   
   ## Testing
   I suggest you add a new top level test binary in https://github.com/apache/arrow-rs/tree/master/parquet/tests called `statistics.rs`
   
   The tests should look like:
   ```
   let record_batch = make_batch_with_relevant_datatype();
   // write batch/batches to file
   // open file / extract stats from metadata
   // compare stats
   ```
   
   I can help writing these tests 
   
   I personally suggest:
   1. Make a PR with the basic API and a few basic types (like Int/UInt and maybe String) and figure out the test pattern (I can definitely help here)
   2. Then we can fill out support for the rest of the types in a follow on PR
   
   cc @tustvold  in case you have other ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add util function to convert from `ParquetStatistics` to `ArrayRef` [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2042405664

   @sundy-li  I wonder if there is code that handles this feature? I didn't see any obvious PRs
   
   https://github.com/apache/arrow-rs/pulls?q=is%3Apr+statistics+is%3Aclosed
   
   (I still harbor goals of getting this feature into arrow-rs)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add function that converts from parquet statistics `ParquetStatistics` to arrow arrays `ArrayRef` [arrow-rs]

Posted by "opensourcegeek (via GitHub)" <gi...@apache.org>.
opensourcegeek commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099369705

   Thanks @alamb - I'll take a stab tomorrow and see how I get on. I'm new to Parquet so please bear with me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add util function to convert from `ParquetStatistics` to `ArrayRef` [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-1830135242

   Note we have started refactoring the code in DataFusion into a format that could reasonable be ported upstream (see https://github.com/apache/arrow-datafusion/pull/8294)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add util function to convert from `ParquetStatistics` to `ArrayRef` [arrow-rs]

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li closed issue #4328: Add util function to convert from `ParquetStatistics` to `ArrayRef`
URL: https://github.com/apache/arrow-rs/issues/4328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org