You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/01 16:27:03 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3976: Correctness integration test for parquet filter pushdown

alamb commented on code in PR #3976:
URL: https://github.com/apache/arrow-datafusion/pull/3976#discussion_r1010635689


##########
datafusion/core/tests/parquet_filter_pushdown.rs:
##########
@@ -0,0 +1,471 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! non trivial integration testing for parquet predicate pushdown
+//!
+//! Testing hints: If you run this test with --nocapture it will tell you where
+//! the generated parquet file went. You can then test it and try out various queries
+//! datafusion-cli like:
+//!
+//! ```sql
+//! create external table data stored as parquet location 'data.parquet';
+//! select * from data limit 10;
+//! ```
+
+use std::time::Instant;
+
+use arrow::compute::concat_batches;
+use arrow::record_batch::RecordBatch;
+use datafusion::logical_expr::{lit, Expr};
+use datafusion::physical_plan::collect;
+use datafusion::physical_plan::metrics::MetricsSet;
+use datafusion::prelude::{col, SessionContext};
+use datafusion_optimizer::utils::{conjunction, disjunction, split_conjunction};
+use itertools::Itertools;
+use lazy_static::lazy_static;
+use parquet_test_utils::{ParquetScanOptions, TestParquetFile};
+use tempfile::TempDir;
+use test_utils::AccessLogGenerator;
+
+/// how many rows of generated data to write to our parquet file (arbitrary)
+const NUM_ROWS: usize = 53819;
+
+// Only create the parquet file once as it is fairly large
+lazy_static! {
+    static ref TEMPDIR: TempDir = TempDir::new().unwrap();
+
+    /// Randomly (but consistently) generated test file. You can use `datafusion-cli` to explore it more carefully
+    static ref TESTFILE: TestParquetFile = {
+        let generator = AccessLogGenerator::new()
+            .with_row_limit(Some(NUM_ROWS));
+
+        // TODO: set the max page rows with some various / arbitrary sizes 8311
+        // (using https://github.com/apache/arrow-rs/issues/2941) to ensure we get multiple pages
+        let page_size = None;
+        let row_group_size = None;
+        let file = TEMPDIR.path().join("data.parquet");
+
+        let start = Instant::now();
+        println!("Writing test data to {:?}", file);
+        match TestParquetFile::try_new(file, generator, page_size, row_group_size) {
+            Err(e) => {
+                panic!("Error writing data: {}", e);
+            }
+            Ok(f) => {
+                println!("Completed generating test data in {:?}", Instant::now() - start);
+                f
+            }
+        }
+    };
+}
+
+#[cfg(not(target_family = "windows"))]
+#[tokio::test]
+async fn selective() {
+    TestCase::new()
+        // request_method = 'GET'
+        .with_filter(col("request_method").eq(lit("GET")))
+        .with_pushdown_expected(PushdownExpected::Some)
+        .with_expected_rows(8886)
+        .run()
+        .await
+}
+
+#[cfg(not(target_family = "windows"))]

Review Comment:
   I am skipping windows for now as these tests fail with some sort of path issue I don't have time to debug. Since there is no logic difference on linux/windows/mac there is no reason this also needs to be run on windows
   
   Example failure:
   https://github.com/apache/arrow-datafusion/actions/runs/3369837748/jobs/5589984001
   
   
   ```
   
   
   ---- basic_conjunction stdout ----
   Writing test data to "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\.tmpDGlfus\\data.parquet"
   Generated test dataset with 53819 rows
   thread 'basic_conjunction' panicked at 'Error writing data: IO error: The filename, directory name, or volume label syntax is incorrect. (os error 123)', datafusion\core\tests\parquet_filter_pushdown.rs:66:17
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```



##########
datafusion/core/tests/parquet_filter_pushdown.rs:
##########
@@ -0,0 +1,471 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! non trivial integration testing for parquet predicate pushdown
+//!
+//! Testing hints: If you run this test with --nocapture it will tell you where
+//! the generated parquet file went. You can then test it and try out various queries
+//! datafusion-cli like:
+//!
+//! ```sql
+//! create external table data stored as parquet location 'data.parquet';
+//! select * from data limit 10;
+//! ```
+
+use std::time::Instant;
+
+use arrow::compute::concat_batches;
+use arrow::record_batch::RecordBatch;
+use datafusion::logical_expr::{lit, Expr};
+use datafusion::physical_plan::collect;
+use datafusion::physical_plan::metrics::MetricsSet;
+use datafusion::prelude::{col, SessionContext};
+use datafusion_optimizer::utils::{conjunction, disjunction, split_conjunction};
+use itertools::Itertools;
+use lazy_static::lazy_static;
+use parquet_test_utils::{ParquetScanOptions, TestParquetFile};
+use tempfile::TempDir;
+use test_utils::AccessLogGenerator;
+
+/// how many rows of generated data to write to our parquet file (arbitrary)
+const NUM_ROWS: usize = 53819;
+
+// Only create the parquet file once as it is fairly large
+lazy_static! {
+    static ref TEMPDIR: TempDir = TempDir::new().unwrap();
+
+    /// Randomly (but consistently) generated test file. You can use `datafusion-cli` to explore it more carefully
+    static ref TESTFILE: TestParquetFile = {
+        let generator = AccessLogGenerator::new()
+            .with_row_limit(Some(NUM_ROWS));
+
+        // TODO: set the max page rows with some various / arbitrary sizes 8311
+        // (using https://github.com/apache/arrow-rs/issues/2941) to ensure we get multiple pages
+        let page_size = None;
+        let row_group_size = None;
+        let file = TEMPDIR.path().join("data.parquet");
+
+        let start = Instant::now();
+        println!("Writing test data to {:?}", file);
+        match TestParquetFile::try_new(file, generator, page_size, row_group_size) {
+            Err(e) => {
+                panic!("Error writing data: {}", e);
+            }
+            Ok(f) => {
+                println!("Completed generating test data in {:?}", Instant::now() - start);
+                f
+            }
+        }
+    };
+}
+
+#[cfg(not(target_family = "windows"))]
+#[tokio::test]
+async fn selective() {
+    TestCase::new()
+        // request_method = 'GET'
+        .with_filter(col("request_method").eq(lit("GET")))
+        .with_pushdown_expected(PushdownExpected::Some)
+        .with_expected_rows(8886)
+        .run()
+        .await
+}
+
+#[cfg(not(target_family = "windows"))]

Review Comment:
   I am skipping windows for now as these tests fail with some sort of path issue I don't have time to debug. Since there is no logic difference on linux/windows/mac there is no reason this also needs to be run on windows
   
   Example failure:
   https://github.com/apache/arrow-datafusion/actions/runs/3369837748/jobs/5589984001
   
   
   ```
   ---- basic_conjunction stdout ----
   Writing test data to "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\.tmpDGlfus\\data.parquet"
   Generated test dataset with 53819 rows
   thread 'basic_conjunction' panicked at 'Error writing data: IO error: The filename, directory name, or volume label syntax is incorrect. (os error 123)', datafusion\core\tests\parquet_filter_pushdown.rs:66:17
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org