You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 13:27:33 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

alamb opened a new pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426


   Closes #363
   
   **Note**: This is an alternate approach to https://github.com/apache/arrow-datafusion/pull/380, providing statistics using `Array`s rather than `ScalarValue`s. Only one of this or #380  should be merged
   
   # Rationale
   As explained on #363 the high level goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)
   
   # Changes:
   1. Introduce a new `PruningStatistics` trait
   2. Refactor `PruningPredicateBuilder` to be generic in terms of `PruningStatistics`
   3. Add documentation and tests
   
   # Example
   
   Here is a brief snippet of one of the tests that shows the new API:
   ```rust
           // Prune using s2 > 5
           let expr = col("s2").gt(lit(5));
   
           // s2 [0, 5] ==> no rows should pass
           let stats1 = TestStatistics::new()
               .with("s1", MinMax::new(None, None))
               .with("s2", MinMax::new(Some(0i32.into()), Some(5i32.into())));
   
           // s2 [4, 6] ==> some rows could pass
           let stats2 = TestStatistics::new()
               .with("s1", MinMax::new(None, None))
               .with("s2", MinMax::new(Some(4i32.into()), Some(6i32.into())));
   
           let p = PruningPredicate::try_new(&expr, schema).unwrap();
           let result = p.prune(&[stats1, stats2]).unwrap();
   
           // false means no rows could possibly match (can prune)
           // true means some rows might match (can not prune)
           let expected = vec![false, true];
   
           assert_eq!(expected, result);
   ```
   
   # Notes:
   I am leaving this PR in draft state in case anyone is interested, as I
   1.  merge its dependencies(see below) in
   2. work on a POC demonstrating the use in IOx
   
   # Sequence:
   
   I am trying to do this in a few small PRs to reduce review burden; Here is how connect together:
   
   Planned changes:
   - [x] Refactor code into a new module (https://github.com/apache/arrow-datafusion/pull/365)
   - [x] Return bool rather than parquet specific output (https://github.com/apache/arrow-datafusion/pull/370)
   - [x] Add `ScalarValue::iter_to_array` (https://github.com/apache/arrow-datafusion/pull/381)
   - [ ] Add `PruningStatstics` Trait (this PR)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#issuecomment-847982815


   FYI @NGA-TRAN 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] NGA-TRAN commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
NGA-TRAN commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638935367



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -52,26 +44,81 @@ use crate::{
     physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
 };
 
+/// Interface to pass statistics information to [`PruningPredicates`]
+///
+/// Returns statistics for containers / files of data in Arrays.
+///
+/// For example, for the following three files with a single column
+/// ```text
+/// file1: column a: min=5, max=10
+/// file2: column a: No stats
+/// file2: column a: min=20, max=30
+/// ```
+///
+/// PruningStatistics should return:
+///
+/// ```text
+/// min_values("a") -> Some([5, Null, 20])
+/// max_values("a") -> Some([20, Null, 30])
+/// min_values("X") -> None

Review comment:
       What is line for?

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -52,26 +44,81 @@ use crate::{
     physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
 };
 
+/// Interface to pass statistics information to [`PruningPredicates`]
+///
+/// Returns statistics for containers / files of data in Arrays.
+///
+/// For example, for the following three files with a single column
+/// ```text
+/// file1: column a: min=5, max=10
+/// file2: column a: No stats
+/// file2: column a: min=20, max=30
+/// ```
+///
+/// PruningStatistics should return:
+///
+/// ```text
+/// min_values("a") -> Some([5, Null, 20])
+/// max_values("a") -> Some([20, Null, 30])

Review comment:
       ```suggestion
   /// max_values("a") -> Some([10, Null, 30])
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638790029



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -52,26 +44,81 @@ use crate::{
     physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
 };
 
+/// Interface to pass statistics information to [`PruningPredicates`]

Review comment:
       Here is the alternate proposed API based off of @Dandandan  and @jorgecarleitao 's comments here: https://github.com/apache/arrow-datafusion/pull/380#discussion_r638143765




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#issuecomment-849131065


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#426](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1e88763) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/ea59d05b6390a0f676956db9160805b3f660cb54?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (ea59d05) will **increase** coverage by `0.52%`.
   > The diff coverage is `83.62%`.
   
   > :exclamation: Current head 1e88763 differs from pull request most recent head 1df3f32. Consider uploading reports for the commit 1df3f32 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/426/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #426      +/-   ##
   ==========================================
   + Coverage   74.84%   75.37%   +0.52%     
   ==========================================
     Files         146      147       +1     
     Lines       24515    24810     +295     
   ==========================================
   + Hits        18349    18700     +351     
   + Misses       6166     6110      -56     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9tb2QucnM=) | `71.42% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2FnZ3JlZ2F0ZS5ycw==) | `85.21% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/sort.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9zb3J0LnJz) | `91.26% <56.25%> (-0.82%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/window\_functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dfZnVuY3Rpb25zLnJz) | `85.71% <57.89%> (-3.01%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=) | `78.70% <61.90%> (-4.06%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/windows.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dzLnJz) | `73.78% <76.57%> (+73.78%)` | :arrow_up: |
   | [...fusion/src/physical\_plan/expressions/row\_number.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9yb3dfbnVtYmVyLnJz) | `81.25% <81.25%> (ø)` | |
   | [datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wYXJxdWV0LnJz) | `82.19% <91.66%> (+0.42%)` | :arrow_up: |
   | [datafusion/src/optimizer/constant\_folding.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL2NvbnN0YW50X2ZvbGRpbmcucnM=) | `91.69% <92.10%> (+0.05%)` | :arrow_up: |
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `90.08% <92.64%> (+0.34%)` | :arrow_up: |
   | ... and [19 more](https://codecov.io/gh/apache/arrow-datafusion/pull/426/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [ea59d05...1df3f32](https://codecov.io/gh/apache/arrow-datafusion/pull/426?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638992159



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -52,26 +44,81 @@ use crate::{
     physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
 };
 
+/// Interface to pass statistics information to [`PruningPredicates`]
+///
+/// Returns statistics for containers / files of data in Arrays.
+///
+/// For example, for the following three files with a single column
+/// ```text
+/// file1: column a: min=5, max=10
+/// file2: column a: No stats
+/// file2: column a: min=20, max=30
+/// ```
+///
+/// PruningStatistics should return:
+///
+/// ```text
+/// min_values("a") -> Some([5, Null, 20])
+/// max_values("a") -> Some([20, Null, 30])
+/// min_values("X") -> None

Review comment:
       I was trying to show that unknown columns are not an error, but should return None
   
   Maybe this would be clearer?
   ```suggestion
   /// min_values("some_other_column") -> None
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638960571



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Collecting to `Vec` might not be necessary here, we could maybe provide it to `ScalarValue::iter_to_array` directly? 

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -141,39 +182,78 @@ impl PruningPredicateBuilder {
             .map(|x| x.unwrap_or(true))
             .collect::<Vec<_>>())
     }
+
+    /// Return a reference to the input schema
+    pub fn schema(&self) -> &SchemaRef {
+        &self.schema
+    }
 }
 
-/// Build a RecordBatch from a list of statistics (currently parquet
-/// [`RowGroupMetadata`] structs), creating arrays, one for each
-/// statistics column, as requested in the stat_column_req parameter.
-fn build_statistics_record_batch(
-    statistics: &[RowGroupMetaData],
-    schema: &Schema,
+/// Build a RecordBatch from a list of statistics, creating arrays,
+/// with one row for each PruningStatistics and columns specified in
+/// in the stat_column_req parameter.
+///
+/// For example, if the requested columns are
+/// ```text
+/// ("s1", Min, Field:s1_min)
+/// ("s2", Max, field:s2_max)
+///```
+///
+/// And the input statistics had
+/// ```text
+/// S1(Min: 5, Max: 10)
+/// S2(Min: 99, Max: 1000)
+/// S3(Min: 1, Max: 2)
+/// ```
+///
+/// Then this function would build a record batch with 2 columns and
+/// one row s1_min and s2_max as follows (s3 is not requested):
+///
+/// ```text
+/// s1_min | s2_max
+/// -------+--------
+///   5    | 1000
+/// ```
+fn build_statistics_record_batch<S: PruningStatistics>(
+    statistics: &S,
     stat_column_req: &[(String, StatisticsType, Field)],
 ) -> Result<RecordBatch> {
     let mut fields = Vec::<Field>::new();
     let mut arrays = Vec::<ArrayRef>::new();
+    // For each needed statistics column:
     for (column_name, statistics_type, stat_field) in stat_column_req {
-        if let Some((column_index, _)) = schema.column_with_name(column_name) {
-            let statistics = statistics
-                .iter()
-                .map(|g| g.column(column_index).statistics())
-                .collect::<Vec<_>>();
-            let array = build_statistics_array(
-                &statistics,
-                *statistics_type,
-                stat_field.data_type(),
-            );
-            fields.push(stat_field.clone());
-            arrays.push(array);
+        let data_type = stat_field.data_type();
+
+        let num_containers = statistics.num_containers();
+
+        let array = match statistics_type {

Review comment:
       So here is the core difference for changing it to array - less work here is needed when the statistics are loaded.
   Trade off there makes sense for me, at least in cases when we can keep the statistics this should be beneficial.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638999032



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Ah I might have had the same problem here: https://github.com/apache/arrow-datafusion/pull/339
   
   I changed ScalarValue::iter_to_array to accept `ScalarValue` instead of `&ScalarValue`
   
   https://github.com/apache/arrow-datafusion/pull/339/files#diff-89202db09c6a169c91c1f7ec44915cf3e61e738e59928eedac7bc7d578a4051fR327




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r640375347



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Should be possible now that the other PR is merged!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r640498821



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Now, sadly, when I try (in 86f80797041ad08b236ac72a8cb810c0d9bd1c26) to remove the collect I get the following panic :(
   
   ```
   ---- physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type stdout ----
   thread 'physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type' panicked at 'Iterator must be sized', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.1.0/src/array/array_boolean.rs:168:33
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r638990402



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       I tried to avoid the collect() but I couldn't get Rust to stop complaining about returning a reference to a local value :(




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r641425690



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       I agree -- 😢 🐼 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r639100976



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Makes sense -- I was trying so hard to avoid having to pass in the owned `ScalarValue` -- and instead it got even worse -- callers have to `collect()` a bunch of owned ones all together!
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#issuecomment-848051797


   > Yap, makes a lot of sense and follows the arrow spirit much better ^_^
   
   Yes, I am happy with how this turned out (thanks to @Dandandan  for suggesting the change 💯 )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #426: Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #426:
URL: https://github.com/apache/arrow-datafusion/pull/426#discussion_r640990959



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -457,11 +461,102 @@ fn send_result(
     Ok(())
 }
 
+/// Wraps parquet statistics in a way
+/// that implements [`PruningStatistics`]
+struct RowGroupPruningStatistics<'a> {
+    row_group_metadata: &'a [RowGroupMetaData],
+    parquet_schema: &'a Schema,
+}
+
+/// Extract the min/max statistics from a `ParquetStatistics` object
+macro_rules! get_statistic {
+    ($column_statistics:expr, $func:ident, $bytes_func:ident) => {{
+        if !$column_statistics.has_min_max_set() {
+            return None;
+        }
+        match $column_statistics {
+            ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
+            ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
+            ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
+            // 96 bit ints not supported
+            ParquetStatistics::Int96(_) => None,
+            ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
+            ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
+            ParquetStatistics::ByteArray(s) => {
+                let s = std::str::from_utf8(s.$bytes_func())
+                    .map(|s| s.to_string())
+                    .ok();
+                Some(ScalarValue::Utf8(s))
+            }
+            // type not supported yet
+            ParquetStatistics::FixedLenByteArray(_) => None,
+        }
+    }};
+}
+
+// Extract the min or max value calling `func` or `bytes_func` on the ParquetStatistics as appropriate
+macro_rules! get_min_max_values {
+    ($self:expr, $column:expr, $func:ident, $bytes_func:ident) => {{
+        let (column_index, field) = if let Some((v, f)) = $self.parquet_schema.column_with_name($column) {
+            (v, f)
+        } else {
+            // Named column was not present
+            return None
+        };
+
+        let data_type = field.data_type();
+        let null_scalar: ScalarValue = if let Ok(v) = data_type.try_into() {
+            v
+        } else {
+            // DataFusion doesn't have support for ScalarValues of the column type
+            return None
+        };
+
+        let scalar_values : Vec<ScalarValue> = $self.row_group_metadata
+            .iter()
+            .flat_map(|meta| {
+                meta.column(column_index).statistics()
+            })
+            .map(|stats| {
+                get_statistic!(stats, $func, $bytes_func)
+            })
+            .map(|maybe_scalar| {
+                // column either did't have statistics at all or didn't have min/max values
+                maybe_scalar.unwrap_or_else(|| null_scalar.clone())
+            })
+            .collect();

Review comment:
       Hm that's sad... I think there should be somewhere a violation somewhere of using the trusted length iterator (without requiring `unsafe`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org