You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/02 20:31:05 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5433: Parquet sorting benchmark

alamb commented on code in PR #5433:
URL: https://github.com/apache/arrow-datafusion/pull/5433#discussion_r1123656890


##########
benchmarks/README.md:
##########
@@ -143,13 +145,13 @@ h2o groupby query 1 took 1669 ms
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
 
-## Parquet filter pushdown benchmarks
+## Parquet benchmarks
 
-This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on
-a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
+This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
+The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
 
 ```base
-cargo run --release --bin parquet_filter_pushdown --  --path ./data --scale-factor 1.0
+cargo run --release --bin parquet --  --path ./data --scale-factor 1.0

Review Comment:
   🤔  I tried this command and got:
   
   ```shell
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run --release --bin parquet -- --path ./data --scale-factor 1.0
   error: no bin target named `parquet`.
   Available bin targets:
       print_config_docs
       tpch
       h2o
       nyctaxi
   ```
   
   I think it is because the file was moved out of `benchmarks/src/bin` and into `benchmarks/parquet.rs`



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");

Review Comment:
   what would you think about adding some way to pick which benchmark to run here. Something like
   
   Run the sort benchmark  like
   ```shell
   cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
   ```
   
   Run the filter benchmark  like
   ```shell
   cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
   ```



##########
benchmarks/parquet.rs:
##########
@@ -84,13 +86,132 @@ async fn main() -> Result<()> {
     }
 
     let test_file = gen_data(path, opt.scale_factor, props_builder.build())?;
-
-    run_benchmarks(opt, &test_file).await?;
-
+    println!("running filter benchmarks");
+    run_filter_benchmarks(opt.clone(), &test_file).await?;
+    println!("running sort benchmarks");
+    run_sort_benchmarks(opt, &test_file).await?;
     Ok(())
 }
 
-async fn run_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> {
+async fn run_sort_benchmarks(opt: Opt, test_file: &TestParquetFile) -> Result<()> {
+    use datafusion::physical_expr::expressions::col;
+    let scan_options_matrix = vec![
+        ParquetScanOptions {
+            pushdown_filters: false,
+            reorder_filters: false,
+            enable_page_index: false,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: true,
+        },
+        ParquetScanOptions {
+            pushdown_filters: true,
+            reorder_filters: true,
+            enable_page_index: false,
+        },
+    ];
+    let schema = test_file.schema();
+    let sort_cases = vec![
+        (
+            "sort utf8",
+            vec![PhysicalSortExpr {
+                expr: col("request_method", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort int",
+            vec![PhysicalSortExpr {
+                expr: col("request_bytes", &schema)?,
+                options: Default::default(),
+            }],
+        ),
+        (
+            "sort decimal",
+            vec![
+                // sort decimal
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort integer tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("response_bytes", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort utf8 tuple",
+            vec![
+                // sort utf8 tuple
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("host", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("pod", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("image", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+        (
+            "sort mixed tuple",
+            vec![
+                PhysicalSortExpr {
+                    expr: col("service", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("request_bytes", &schema)?,
+                    options: Default::default(),
+                },
+                PhysicalSortExpr {
+                    expr: col("decimal_price", &schema)?,
+                    options: Default::default(),
+                },
+            ],
+        ),
+    ];
+    for (title, expr) in sort_cases {
+        println!("Executing '{title}' (sorting by: {expr:?})");
+        for scan_options in &scan_options_matrix {

Review Comment:
   since there are no filters in this test, I don't think it is necessary to try the various different filter options -- just the different sorts would be great.



##########
benchmarks/parquet.rs:
##########
@@ -178,6 +299,22 @@ async fn exec_scan(
     Ok(result.iter().map(|b| b.num_rows()).sum())
 }
 
+async fn exec_sort(
+    ctx: &SessionContext,
+    expr: &[PhysicalSortExpr],
+    test_file: &TestParquetFile,
+    debug: bool,
+) -> Result<()> {
+    let scan = test_file.create_scan(lit(true)).await?;

Review Comment:
   I am not sure the `lit(true)` filter will be removed at this level. What do you think about making `TestParquetFile::create_scan` take `Option<Expr>` instead of passing in lit true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org