You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "jaylmiller (via GitHub)" <gi...@apache.org> on 2023/02/28 22:31:51 UTC

[GitHub] [arrow-datafusion] jaylmiller opened a new pull request, #5433: Parquet sorting benchmark

jaylmiller opened a new pull request, #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433

   # Which issue does this PR close?
   Progress towards #5230 (does not close).
   
   # Rationale for this change
   We want to see how the row encoding changes proposed in #5230 impacts performance of sorting large parquet files.
   
   # What changes are included in this PR?
   - Added sort benchmark cases to the existing parquet benchmark (don't want to end up with even more benchmarks).
   - Since there are now multiple benchmark cases in that bin, renamed file from `parquet_filter_pushdown.rs` to `parquet.rs`,
   since it is no longer purely benchmarking the filter pushdown.
   - Updated `benchmarks/README.md` to account for said changes.
   
   # Are these changes tested?
   
   # Are there any user-facing changes?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123834286


##########
benchmarks/README.md:
##########
@@ -143,13 +145,13 @@ h2o groupby query 1 took 1669 ms
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
 
-## Parquet filter pushdown benchmarks
+## Parquet benchmarks
 
-This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on
-a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
+This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
+The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
 
 ```base
-cargo run --release --bin parquet_filter_pushdown --  --path ./data --scale-factor 1.0
+cargo run --release --bin parquet --  --path ./data --scale-factor 1.0

Review Comment:
   yes that's my bad 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #5433: Parquet sorting benchmark

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#issuecomment-1453417828

   Benchmark runs are scheduled for baseline = d11820aa256284f9a817c7b699a548f9c3e1c399 and contender = 1455b025cec3f438f7c1a0aae3e0f1939755d84e. 1455b025cec3f438f7c1a0aae3e0f1939755d84e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a98b1f89c8bf4f46b92da89c4777fa7f...41e651135ff643cd85690a543adaec0c/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b6e7bc5da8c84651a243097180f80025...3a10ae995df74655b866d78cb1730fc0/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b11334c4beb7478cbf217eb2c49b6e99...20d89ce971df4d8c88a322d1159af9e1/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/876465b9d8144c36b87578ddbec77126...92a673a3a34d46698997a5ea1b2c5704/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123839102


##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   It seem's like this is the only place in the codebase that is using this though, so it should be fine to change that method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1124346116


##########
benchmarks/README.md:
##########
@@ -176,3 +181,13 @@ Iteration 1 returned 1781686 rows in 1986 ms
 Iteration 2 returned 1781686 rows in 1947 ms
 ...
 ```
+
+Similarly, to run sorting benchmarks, run:
+
+```base
+cargo run --release --bin parquet -- sort  --path ./data --scale-factor 1.0

Review Comment:
   I tried this out and it works great! 
   
   ```
   cargo run --bin parquet -- sort  --path ./data --scale-factor 0.1
   ```



##########
parquet-test-utils/src/lib.rs:
##########
@@ -148,25 +155,26 @@ impl TestParquetFile {
         // run coercion on the filters to coerce types etc.
         let props = ExecutionProps::new();
         let context = SimplifyContext::new(&props).with_schema(df_schema.clone());
-        let simplifier = ExprSimplifier::new(context);
-        let filter = simplifier.coerce(filter, df_schema.clone()).unwrap();
-
-        let physical_filter_expr = create_physical_expr(
-            &filter,
-            &df_schema,
-            self.schema.as_ref(),
-            &ExecutionProps::default(),
-        )?;
-
-        let parquet_exec = Arc::new(ParquetExec::new(
-            scan_config,
-            Some(physical_filter_expr.clone()),
-            None,
-        ));
-
-        let exec = Arc::new(FilterExec::try_new(physical_filter_expr, parquet_exec)?);
-
-        Ok(exec)
+        if let Some(filter) = maybe_filter {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #5433: Parquet sorting benchmark

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123839102


##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   Its only used in one other place in the codebase, so changing that method signature should be fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123837562


##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   Yeah that is a good solution. My initial thinking was I didn't want to change another crate (`parquet-test-utils`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5433: Parquet sorting benchmark

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123656890


##########
benchmarks/README.md:
##########
@@ -143,13 +145,13 @@ h2o groupby query 1 took 1669 ms
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
 
-## Parquet filter pushdown benchmarks
+## Parquet benchmarks
 
-This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on
-a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
+This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
+The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
 
 ```base
-cargo run --release --bin parquet_filter_pushdown --  --path ./data --scale-factor 1.0
+cargo run --release --bin parquet --  --path ./data --scale-factor 1.0

Review Comment:
   🤔  I tried this command and got:
   
   ```shell
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run --release --bin parquet -- --path ./data --scale-factor 1.0
   error: no bin target named `parquet`.
   Available bin targets:
       print_config_docs
       tpch
       h2o
       nyctaxi
   ```
   
   I think it is because the file was moved out of `benchmarks/src/bin` and into `benchmarks/parquet.rs`



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");

Review Comment:
   what would you think about adding some way to pick which benchmark to run here. Something like
   
   Run the sort benchmark  like
   ```shell
   cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
   ```
   
   Run the filter benchmark  like
   ```shell
   cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
   ```



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");
+    run_sort_benchmarks(opt, &test_file).await?;
     Ok(())
 }
 
-async fn run_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> {
+async fn run_sort_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> {
+    use datafusion::physical_expr::expressions::col;
+    let scan_options_matrix = vec![
+        ParquetScanOptions {
+            pushdown_filters: false,
+            reorder_filters: false,
+            enable_page_index: false,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: true,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: false,
+        },
+    ];
+    let schema = test_file.schema();
+    let sort_cases = vec![
+        (
+            "sort utf8",
+            vec![PhysicalSortExpr {
+                expr: col("request_method", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort int",
+            vec![PhysicalSortExpr {
+                expr: col("request_bytes", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort decimal",
+            vec![
+                // sort decimal
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort integer tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("response_bytes", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort utf8 tuple",
+            vec![
+                // sort utf8 tuple
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("host", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("pod", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("image", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort mixed tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+    ];
+    for (title, expr) in sort_cases {
+        println!("Executing '{title}' (sorting by: {expr:?})");
+        for scan_options in &scan_options_matrix {

Review Comment:
   since there are no filters in this test, I don't think it is necessary to try the various different filter options -- just the different sorts would be great.



##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   I am not sure the `lit(true)` filter will be removed at this level. What do you think about making `TestParquetFile::create_scan` take `Option<Expr>` instead of passing in lit true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5433: Parquet sorting benchmark

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#issuecomment-1452742955

   Thanks @alamb agree with all these ideas 😀. PR updated accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org