You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/03 11:17:24 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

alamb opened a new pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657


   This is the test harness I intend to use to validate the fix for https://github.com/apache/arrow-datafusion/issues/656 and https://github.com/apache/arrow-datafusion/issues/649. 
   
   # Rationale:
   Parquet pruning is broken (https://github.com/apache/arrow-datafusion/issues/656 see https://github.com/apache/arrow-datafusion/issues/649) but all our tests are passing. This is not good ...
   
   The test and infrastructure is pretty large itself so I wanted to get it reviewed separately prior to the actual bug fixes
   
   # Changes
   1. Add an end-to-end test for parquet pruning
   2. Add statistics to parquet reader (that are used in the tests)
   3. Add a "plan_statistics" function to gather all `SQLMetrics` after a plan has been executed
   
   I plan to extend the parquet test to cover more cases as I work on fixing the bugs
   
   # Are there any user-facing changes?
   There are some new user facing SQL metrics
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yordan-pavlov commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

yordan-pavlov commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663397093



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -555,21 +625,29 @@ fn build_row_group_predicate(
         row_group_metadata,
         parquet_schema,
     };
-
     let predicate_values = predicate_builder.prune(&pruning_stats);
 
-    let predicate_values = match predicate_values {
-        Ok(values) => values,
+    match predicate_values {
+        Ok(values) => Box::new(move |_, i| {
+            // NB: false means don't scan row group
+            if !values[i] {

Review comment:
       should the counting of filtered row groups happen in the actual filter function? what is the benefit of doing the counting inside the filter function? why not move it outside, just before the filter function is returned (similar to the error case below)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663857848



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -72,6 +76,8 @@ pub struct ParquetExec {
     batch_size: usize,
     /// Statistics for the data set (sum of statistics for all partitions)
     statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,

Review comment:
       in 595fcdd07b
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yordan-pavlov commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

yordan-pavlov commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663396217



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -72,6 +76,8 @@ pub struct ParquetExec {
     batch_size: usize,
     /// Statistics for the data set (sum of statistics for all partitions)
     statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,

Review comment:
       would it make sense to move `predicate_creation_errors` into `ParquetMetrics`, next to the other two metrics?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663851278



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -555,21 +625,29 @@ fn build_row_group_predicate(
         row_group_metadata,
         parquet_schema,
     };
-
     let predicate_values = predicate_builder.prune(&pruning_stats);
 
-    let predicate_values = match predicate_values {
-        Ok(values) => values,
+    match predicate_values {
+        Ok(values) => Box::new(move |_, i| {
+            // NB: false means don't scan row group
+            if !values[i] {

Review comment:
       One difference that results from updating the metrics right before use is what happens in a `limit` query -- ala `SELECT * from parquet_table where date < 2012-20-20 limit 10` -- in this case the  query might never even contemplate reading a particular partition due to the limit
   
   However, having written that I think it would make the statistics easier to understand (and more consistent query to query) to report the stats prior to the actual function. I will make that change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yordan-pavlov commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

yordan-pavlov commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-873471346


   looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yordan-pavlov commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

yordan-pavlov commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663397093



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -555,21 +625,29 @@ fn build_row_group_predicate(
         row_group_metadata,
         parquet_schema,
     };
-
     let predicate_values = predicate_builder.prune(&pruning_stats);
 
-    let predicate_values = match predicate_values {
-        Ok(values) => values,
+    match predicate_values {
+        Ok(values) => Box::new(move |_, i| {
+            // NB: false means don't scan row group
+            if !values[i] {

Review comment:
       should the counting of filtered row groups happen in the actual filter function? what is the benefit of doing the counting inside the filter function? why not move it outside, just before the filter function is returned?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-874028870


   > looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.
   
   
   Indeed @yordan-pavlov  - in fact I think @andygrove  is doing just that with PRs such as https://github.com/apache/arrow-datafusion/pull/662 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-873392246


   FYI @houqp and @yordan-pavlov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-874899734


   No concerns! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-874831424


   @Dandandan  / @yordan-pavlov / @houqp , any concerns about merging this one in?  I now have the fix PRs backed up behind this test PR so I would like to get the test in so I can get the fixes up for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663849500



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -72,6 +76,8 @@ pub struct ParquetExec {
     batch_size: usize,
     /// Statistics for the data set (sum of statistics for all partitions)
     statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,

Review comment:
       The reason I didn't do this is because the other metrics are per `ParquetPartition` (aka per each file / set of files) but this one is just one metric for the overall `ParquetExec`
   
   I agree this is confusing -- I'll make two statistics structures (`ParquetPartitionMetrics` and `ParquetMetrics`) which hopefully will make this clearer

##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -555,21 +625,29 @@ fn build_row_group_predicate(
         row_group_metadata,
         parquet_schema,
     };
-
     let predicate_values = predicate_builder.prune(&pruning_stats);
 
-    let predicate_values = match predicate_values {
-        Ok(values) => values,
+    match predicate_values {
+        Ok(values) => Box::new(move |_, i| {
+            // NB: false means don't scan row group
+            if !values[i] {

Review comment:
       One difference that results from updating the metrics right before use is what happens in a `limit` query -- ala `SELECT * from parquet_table where date < 2012-20-20 limit 10` -- in this case the  query might never even contemplate reading a particular partition due to the limit
   
   However, having written that I think it would make the statistics easier to understand (and more consistent query to query) to report the stats prior to the actual function. I will make that change

##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -72,6 +76,8 @@ pub struct ParquetExec {
     batch_size: usize,
     /// Statistics for the data set (sum of statistics for all partitions)
     statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,

Review comment:
       in 595fcdd07b
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#issuecomment-874028870


   > looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.
   
   
   Indeed @yordan-pavlov  - in fact I think @andygrove  is doing just that with PRs such as https://github.com/apache/arrow-datafusion/pull/662 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663353376



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -59,6 +61,8 @@ use crate::datasource::datasource::{ColumnStatistics, Statistics};
 use async_trait::async_trait;
 use futures::stream::{Stream, StreamExt};
 
+use super::SQLMetric;

Review comment:
       The changes to this file are to add metrics on the pruning (that are then used in the test)

##########
File path: datafusion/src/test/mod.rs
##########
@@ -251,11 +251,11 @@ pub fn make_timestamps() -> RecordBatch {
     let arr_names = StringArray::from(names);
 
     let schema = Schema::new(vec![
-        Field::new("nanos", arr_nanos.data_type().clone(), false),
-        Field::new("micros", arr_micros.data_type().clone(), false),
-        Field::new("millis", arr_millis.data_type().clone(), false),
-        Field::new("secs", arr_secs.data_type().clone(), false),
-        Field::new("name", arr_names.data_type().clone(), false),
+        Field::new("nanos", arr_nanos.data_type().clone(), true),

Review comment:
       This was a bug I found while using similar code for the end to end test - the actual data has `None` so the schema needs to be marked "nullable"

##########
File path: datafusion/tests/parquet_pruning.rs
##########
@@ -0,0 +1,343 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This file contains an end to end test of parquet pruning. It writes
+// data into a parquet file and then
+use std::sync::Arc;
+
+use arrow::{
+    array::{
+        Array, StringArray, TimestampMicrosecondArray, TimestampMillisecondArray,
+        TimestampNanosecondArray, TimestampSecondArray,
+    },
+    datatypes::{Field, Schema},
+    record_batch::RecordBatch,
+    util::pretty::pretty_format_batches,
+};
+use chrono::Duration;
+use datafusion::{
+    physical_plan::{plan_metrics, SQLMetric},
+    prelude::ExecutionContext,
+};
+use hashbrown::HashMap;
+use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
+use tempfile::NamedTempFile;
+
+#[tokio::test]

Review comment:
       Here are the new end to end tests -- they make an actual parquet file and then run a query against it, validating the pruning metrics -- they currently "pass" but they show there is no actual pruning occuring




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #657: Add End-to-end test for parquet pruning + metrics for ParquetExec

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #657:
URL: https://github.com/apache/arrow-datafusion/pull/657#discussion_r663849500



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -72,6 +76,8 @@ pub struct ParquetExec {
     batch_size: usize,
     /// Statistics for the data set (sum of statistics for all partitions)
     statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,

Review comment:
       The reason I didn't do this is because the other metrics are per `ParquetPartition` (aka per each file / set of files) but this one is just one metric for the overall `ParquetExec`
   
   I agree this is confusing -- I'll make two statistics structures (`ParquetPartitionMetrics` and `ParquetMetrics`) which hopefully will make this clearer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org