You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "crepererum (via GitHub)" <gi...@apache.org> on 2023/02/24 10:26:12 UTC

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #5386: refactor: parquet pruning simplifications

crepererum opened a new pull request, #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386

   # Which issue does this PR close?
   Prep work for #4695.
   
   # Rationale for this change
   Makes the actual change in #4695 easier.
   
   # What changes are included in this PR?
   Internal improvements.
   
   # Are these changes tested?
   Existing tests pass, no function changes.
   
   # Are there any user-facing changes?
   \-


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio merged pull request #5386: refactor: parquet pruning simplifications

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio merged PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on a diff in pull request #5386: refactor: parquet pruning simplifications

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on code in PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386#discussion_r1117927123


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -721,23 +730,20 @@ fn build_predicate_expression(
     let (left, op, right) = match expr {
         Expr::BinaryExpr(BinaryExpr { left, op, right }) => (left, *op, right),
         Expr::IsNull(expr) => {
-            let expr = build_is_null_column_expr(expr, schema, required_columns)
+            return build_is_null_column_expr(expr, schema, required_columns)
                 .unwrap_or(unhandled);
-            return Ok(expr);

Review Comment:
   Yes 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5386: refactor: parquet pruning simplifications

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386#issuecomment-1444482405

   Benchmark runs are scheduled for baseline = 722490121cdae8551faf07aeec53dd58655cde1c and contender = 1841736d95e7fa1af6f9241eb65a27e98f458518. 1841736d95e7fa1af6f9241eb65a27e98f458518 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/0fd7be0462a343eea1cb60a459823d9a...1050d4a63a4442aa8aec9ad6be7809d1/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/0fc3d6ded38c4988bd7f5c3f0f36cea8...ab4a657bed424170b535ee7d17509ae6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/728f35ef703d4adea1108512ef32da88...60495f19c53a4aafafe87e24bf2c15a2/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9cb73593b03844979443fa52faf4ee17...45b54b6c02464d21a47f3ccd72fb8371/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on a diff in pull request #5386: refactor: parquet pruning simplifications

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on code in PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386#discussion_r1117927548


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -110,14 +110,16 @@ impl PagePruningPredicate {
     pub fn try_new(expr: &Expr, schema: SchemaRef) -> Result<Self> {
         let predicates = split_conjunction(expr)
             .into_iter()
-            .filter_map(|predicate| match predicate.to_columns() {
-                Ok(columns) if columns.len() == 1 => {
-                    match PruningPredicate::try_new(predicate.clone(), schema.clone()) {
-                        Ok(p) if !p.allways_true() => Some(Ok(p)),
-                        _ => None,
+            .filter_map(|predicate| {
+                match PruningPredicate::try_new(predicate.clone(), schema.clone()) {
+                    Ok(p)
+                        if (!p.allways_true())
+                            && (p.required_columns().n_columns() < 2) =>

Review Comment:
   Yeah, we skip the predicate if we don't refer to any column. However you might be right (at least this is how I read your comment) that we need additional test coverage for a "constant" predicate (i.e. one that doesn't reference any column). I'll check next week if such a test exists and if not, add one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #5386: refactor: parquet pruning simplifications

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386#issuecomment-1443496116

   I am pretty sure that this test breakage is unrelated, seems flaky to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #5386: refactor: parquet pruning simplifications

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on code in PR #5386:
URL: https://github.com/apache/arrow-datafusion/pull/5386#discussion_r1117622609


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -721,23 +730,20 @@ fn build_predicate_expression(
     let (left, op, right) = match expr {
         Expr::BinaryExpr(BinaryExpr { left, op, right }) => (left, *op, right),
         Expr::IsNull(expr) => {
-            let expr = build_is_null_column_expr(expr, schema, required_columns)
+            return build_is_null_column_expr(expr, schema, required_columns)
                 .unwrap_or(unhandled);
-            return Ok(expr);

Review Comment:
   Oh, so this was _always_ returning Ok?



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -110,14 +110,16 @@ impl PagePruningPredicate {
     pub fn try_new(expr: &Expr, schema: SchemaRef) -> Result<Self> {
         let predicates = split_conjunction(expr)
             .into_iter()
-            .filter_map(|predicate| match predicate.to_columns() {
-                Ok(columns) if columns.len() == 1 => {
-                    match PruningPredicate::try_new(predicate.clone(), schema.clone()) {
-                        Ok(p) if !p.allways_true() => Some(Ok(p)),
-                        _ => None,
+            .filter_map(|predicate| {
+                match PruningPredicate::try_new(predicate.clone(), schema.clone()) {
+                    Ok(p)
+                        if (!p.allways_true())
+                            && (p.required_columns().n_columns() < 2) =>

Review Comment:
   This is a behavior change for `n_columns() == 0`. Based on:
   
   ```
       pub fn allways_true(&self) -> bool {
           self.predicate_expr
               .as_any()
               .downcast_ref::<Literal>()
               .map(|l| matches!(l.value(), ScalarValue::Boolean(Some(true))))
               .unwrap_or_default()
       }
   ```
   
   I ran the test suite, panicing if `n_columns() == 0` and I can't get it to happen, so I guess it LGTM.
   I assume that would default to false, in which case I think we'd want to return a `None` here?



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -258,6 +259,14 @@ impl RequiredStatColumns {
         Self::default()
     }
 
+    /// Returns number of unique columns.
+    pub(crate) fn n_columns(&self) -> usize {
+        self.iter()
+            .map(|(c, _s, _f)| c)

Review Comment:
   More descriptive variable names would help readability here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org