You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "yjshen (via GitHub)" <gi...@apache.org> on 2023/04/30 02:41:16 UTC

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #6131: Add bench.sh script to automate benchmarking DataFusion against itself

yjshen commented on code in PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131#discussion_r1181154338


##########
benchmarks/README.md:
##########
@@ -19,29 +19,139 @@
 
 # DataFusion Benchmarks
 
-This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
-run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
-implementations as well as other query engines.
+This crate contains benchmarks based on popular public data sets and
+open source benchmark suites, making it easy to run more realistic
+benchmarks to help with performance and scalability testing of DataFusion.
 
-## Benchmark derived from TPC-H
+# Benchmarks Against Other Engines
 
-These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
-https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
+DataFusion is included in the benchmark setups for several popular
+benchmarks that compare performance with other engines. For example:
 
-## Generating Test Data
+* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
+* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory
 
-TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data
-generator.
+[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
+[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
 
-```bash
-# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
-./tpch-gen.sh <scale_factor>
+# Running the benchmarks
+
+## Running Benchmarks
+
+The easiest way to run benchmarks from DataFusion source checkouts is
+to use the [bench.sh](bench.sh) script. Usage instructions can be
+found with:
+
+```shell
+# show usage
+./bench.sh
+```
+
+## Generating Data
+
+You can create data for all these benchmarks using the [bench.sh](bench.sh) script:
+
+```shell
+./bench.sh data
+```
+
+Data is generated in the `data` subdirectory and will not be checked
+in because this directory has been added to the `.gitignore` file.
+
+
+## Example to compare peformance on main to a branch
+
+```shell
+git checkout main
+
+# Create the data
+./benchmarks/bench.sh data
+
+# Gather baseline data for tpch benchmark
+./benchmarks/bench.sh run tpch
+
+# Switch to the branch the branch name is mybranch and gather data

Review Comment:
   👍 I was curious before about what's the magic for comparing branches



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org