You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/27 19:35:49 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3989: Add parquet predicate pushdown metrics

alamb opened a new pull request, #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989

   # Which issue does this PR close?
   
   Part of https://github.com/apache/arrow-datafusion/issues/3463
   
   
   
   # Rationale for this change
   
   I am trying to verify the correctness and efficiency of parquet predicate pushdown, so that we can turn it on in datafusion by default. Thus I want metrics telling me how many rows were pruned as well as how long it took.
   
   # What changes are included in this PR?
   
   1. Adds metric for how many rows were pruned using predicate pushdown
   2. Add metric for how long the pruning took
   3. Add tests
   
   # Are there any user-facing changes?
   New metrics: 
   
   TODO: demonstrate ising DataFusion CLI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb merged PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1296230282

   Benchmark runs are scheduled for baseline = 71f05a3175e931943276a8778162e37504e00503 and contender = afc299a48b6c2438643d0c84c408ba104424dbd4. afc299a48b6c2438643d0c84c408ba104424dbd4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9fdebdc77f264af6aeb5acc2600526d6...574027ff165a472d9b757534b4e80c99/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/48ae9298657f45089d9b8e3fa9ad3fe2...72e2e75677ab4a13814ef67ba4c4e745/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/722e1871bde7427fb7b43f7b4168d439...95fd46fa889547b09304704748039c2e/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/47ef0c77ffbf449e9c765f3f86dfdbbb...a6e700e25df940efabc076a1ee920225/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1007369403


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   https://github.com/apache/arrow-rs/pull/2957
   
   Could copy paste for now, it isn't hugely complicated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1294391360

   > cc @Ted-Jiang I am thinking we can use a similar approach to validate / verify the PageIndex pruning you are working on
   
   Sounds great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1295053781

   > LGTM, I think if there any more than one predicates in one query, should we record the each predicate's input records count to calculate the efficiency 🤔 So could guide the user do the rearrangement.
   
   It is a great idea -- filed https://github.com/apache/arrow-datafusion/issues/3998 and I will work on that next


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1007331972


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   If you don't care about nulls
   
   ```
   bool_arr.values().count_set_bits_offset(self.offset(), self.len())
   ```
   
   If you do care about nulls it is slightly more complicated, I'll get something into arrow-rs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1007328633


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   @tustvold  do you have any suggestions on how to count the number of true values in a boolean array faster/better than this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1296227115

   I am not sure when I will have time to add per-predicate metrics -- I'll see how my other projects go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1295236202

   I am going to merge this in shortly as I have several other PRs https://github.com/apache/arrow-datafusion/pull/3976 and one in IOx that depend on it, unless there are objections


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1007371564


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   I do care about nulls, sadly, -- it needs to be Non-null and true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#issuecomment-1294022370

   cc @Ted-Jiang  I am thinking we can use a similar approach to validate / verify the PageIndex pruning you are working on 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1007371815


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   I can file a ticket in arrow-rs if that would be helpful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3989: Add parquet predicate pushdown metrics

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3989:
URL: https://github.com/apache/arrow-datafusion/pull/3989#discussion_r1008096522


##########
datafusion/core/src/physical_plan/file_format/row_filter.rs:
##########
@@ -98,14 +108,22 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        // scoped timer updates on drop
+        let mut timer = self.time.timer();
         match self
             .physical_expr
             .evaluate(&batch)
             .map(|v| v.into_array(batch.num_rows()))
         {
             Ok(array) => {
                 if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?

Review Comment:
   Thanks again @tustvold  -- I filed https://github.com/apache/arrow-rs/issues/2963  and will copy/paste your implementation here for nwo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org