You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/26 20:29:06 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6131: Add bench script to benchmark datafusion against itself

alamb opened a new pull request, #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131

   # Which issue does this PR close?
   Closes https://github.com/apache/arrow-datafusion/issues/6127
   
   # Rationale for this change
   See https://github.com/apache/arrow-datafusion/issues/6127
   
   # What changes are included in this PR?
   
   - [x] Add a bench.sh script that orchestrates creating the data files and orchestrating executing the benchmarks
   - [ ] Update benchmark documentation
   - [ ] Remove outdated tpch_dbgen.sh script
   
   # Are these changes tested?
   
   I tested them manually on an x86 mac and a Linux x86 machine.
   
   # Are there any user-facing changes?
   
   No, it is just development scripts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6131: Add bench.sh script to benchmark DataFusion against itself

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131#issuecomment-1524050533

   Some interesting results already -- I ran a quick experiment to see how much 'lto' link time optimization helps. The answer is "quite a bit"
   
   ```
   alamb@aal-dev:~/arrow-datafusion2/benchmarks$ python compare.py results/alamb_bench/tpch.json results/alamb_bench_compare/tpch.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 1     │    1269.30ms │    1097.60ms │ +1.16x faster │
   │ QQuery 2     │     418.17ms │     309.01ms │ +1.35x faster │
   │ QQuery 3     │     393.15ms │     365.30ms │ +1.08x faster │
   │ QQuery 4     │     212.83ms │     214.36ms │     no change │
   │ QQuery 5     │     534.56ms │     531.17ms │     no change │
   │ QQuery 6     │     209.39ms │     184.41ms │ +1.14x faster │
   │ QQuery 7     │    1037.82ms │     981.81ms │ +1.06x faster │
   │ QQuery 8     │     550.99ms │     540.95ms │     no change │
   │ QQuery 9     │     982.00ms │     984.53ms │     no change │
   │ QQuery 10    │     613.48ms │     560.14ms │ +1.10x faster │
   │ QQuery 11    │     272.45ms │     231.46ms │ +1.18x faster │
   │ QQuery 12    │     319.91ms │     320.54ms │     no change │
   │ QQuery 13    │    1127.46ms │    1087.70ms │     no change │
   │ QQuery 14    │     286.89ms │     263.16ms │ +1.09x faster │
   │ QQuery 15    │     255.63ms │     233.42ms │ +1.10x faster │
   │ QQuery 16    │     302.94ms │     309.25ms │     no change │
   │ QQuery 17    │    2891.05ms │    2628.59ms │ +1.10x faster │
   │ QQuery 18    │    3123.23ms │    3154.47ms │     no change │
   │ QQuery 19    │     511.97ms │     472.75ms │ +1.08x faster │
   │ QQuery 20    │    1042.76ms │     938.75ms │ +1.11x faster │
   │ QQuery 21    │    1567.78ms │    1611.91ms │     no change │
   │ QQuery 22    │     182.56ms │     171.52ms │ +1.06x faster │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   alamb@aal-dev:~/arrow-datafusion2/benchmarks$ python compare.py results/alamb_bench/tpch_mem.json results/alamb_bench_compare/tpch_mem.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃           -o ┃           -o ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 1     │     876.15ms │     796.57ms │ +1.10x faster │
   │ QQuery 2     │     265.87ms │     267.19ms │     no change │
   │ QQuery 3     │     169.43ms │     164.82ms │     no change │
   │ QQuery 4     │     110.07ms │     116.02ms │  1.05x slower │
   │ QQuery 5     │     462.34ms │     449.41ms │     no change │
   │ QQuery 6     │      44.40ms │      40.54ms │ +1.10x faster │
   │ QQuery 7     │    1099.47ms │    1077.84ms │     no change │
   │ QQuery 8     │     241.97ms │     247.20ms │     no change │
   │ QQuery 9     │     584.01ms │     606.74ms │     no change │
   │ QQuery 10    │     301.95ms │     299.14ms │     no change │
   │ QQuery 11    │     239.17ms │     221.07ms │ +1.08x faster │
   │ QQuery 12    │     153.73ms │     139.95ms │ +1.10x faster │
   │ QQuery 13    │     793.76ms │     753.25ms │ +1.05x faster │
   │ QQuery 14    │      59.50ms │      49.38ms │ +1.20x faster │
   │ QQuery 15    │     103.03ms │      89.82ms │ +1.15x faster │
   │ QQuery 16    │     216.38ms │     213.85ms │     no change │
   │ QQuery 17    │    3356.99ms │    2866.85ms │ +1.17x faster │
   │ QQuery 18    │    3017.82ms │    2910.50ms │     no change │
   │ QQuery 19    │     161.19ms │     137.81ms │ +1.17x faster │
   │ QQuery 20    │     924.38ms │     855.74ms │ +1.08x faster │
   │ QQuery 21    │    1502.52ms │    1460.54ms │     no change │
   │ QQuery 22    │     133.10ms │     128.34ms │     no change │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6131: Add bench.sh script to automate benchmarking DataFusion against itself

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131#discussion_r1180538641


##########
benchmarks/README.md:
##########
@@ -19,11 +19,31 @@
 
 # DataFusion Benchmarks
 
-This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
-run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
-implementations as well as other query engines.
+This crate contains benchmarks based on popular public data sets and
+open source benchmark suites, making it easy to run more realistic
+benchmarks to help with performance and scalability testing of DataFusion.
 
-## Benchmark derived from TPC-H
+# Benchmarks Against Other Engines
+
+DataFusion is included in the benchmark setups for several popular
+benchmarks that compare performance with other engines. For example:
+
+* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
+* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory
+
+[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
+[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
+
+# Running the benchmarks
+
+## Generating Data
+
+Please use the [bench.sh] script to generate data
+
+
+# Benchmark Descriptions:
+
+## `tpch` Benchmark derived from TPC-H
 
 These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:

Review Comment:
   I next hope / plan tor review the other benchmarks and consolidate them and their data generation and runner scripts into the bench.sh framework



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #6131: Add bench.sh script to automate benchmarking DataFusion against itself

Posted by "yjshen (via GitHub)" <gi...@apache.org>.

yjshen commented on code in PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131#discussion_r1181154338


##########
benchmarks/README.md:
##########
@@ -19,29 +19,139 @@
 
 # DataFusion Benchmarks
 
-This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
-run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
-implementations as well as other query engines.
+This crate contains benchmarks based on popular public data sets and
+open source benchmark suites, making it easy to run more realistic
+benchmarks to help with performance and scalability testing of DataFusion.
 
-## Benchmark derived from TPC-H
+# Benchmarks Against Other Engines
 
-These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
-https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
+DataFusion is included in the benchmark setups for several popular
+benchmarks that compare performance with other engines. For example:
 
-## Generating Test Data
+* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
+* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory
 
-TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data
-generator.
+[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
+[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
 
-```bash
-# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
-./tpch-gen.sh <scale_factor>
+# Running the benchmarks
+
+## Running Benchmarks
+
+The easiest way to run benchmarks from DataFusion source checkouts is
+to use the [bench.sh](bench.sh) script. Usage instructions can be
+found with:
+
+```shell
+# show usage
+./bench.sh
+```
+
+## Generating Data
+
+You can create data for all these benchmarks using the [bench.sh](bench.sh) script:
+
+```shell
+./bench.sh data
+```
+
+Data is generated in the `data` subdirectory and will not be checked
+in because this directory has been added to the `.gitignore` file.
+
+
+## Example to compare peformance on main to a branch
+
+```shell
+git checkout main
+
+# Create the data
+./benchmarks/bench.sh data
+
+# Gather baseline data for tpch benchmark
+./benchmarks/bench.sh run tpch
+
+# Switch to the branch the branch name is mybranch and gather data

Review Comment:
   👍 I was curious before about what's the magic for comparing branches



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6131: Add bench.sh script to automate benchmarking DataFusion against itself

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131#discussion_r1181206827


##########
benchmarks/README.md:
##########
@@ -19,29 +19,139 @@
 
 # DataFusion Benchmarks
 
-This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
-run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
-implementations as well as other query engines.
+This crate contains benchmarks based on popular public data sets and
+open source benchmark suites, making it easy to run more realistic
+benchmarks to help with performance and scalability testing of DataFusion.
 
-## Benchmark derived from TPC-H
+# Benchmarks Against Other Engines
 
-These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
-https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
+DataFusion is included in the benchmark setups for several popular
+benchmarks that compare performance with other engines. For example:
 
-## Generating Test Data
+* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
+* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory
 
-TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data
-generator.
+[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
+[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
 
-```bash
-# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
-./tpch-gen.sh <scale_factor>
+# Running the benchmarks
+
+## Running Benchmarks
+
+The easiest way to run benchmarks from DataFusion source checkouts is
+to use the [bench.sh](bench.sh) script. Usage instructions can be
+found with:
+
+```shell
+# show usage
+./bench.sh
+```
+
+## Generating Data
+
+You can create data for all these benchmarks using the [bench.sh](bench.sh) script:
+
+```shell
+./bench.sh data
+```
+
+Data is generated in the `data` subdirectory and will not be checked
+in because this directory has been added to the `.gitignore` file.
+
+
+## Example to compare peformance on main to a branch
+
+```shell
+git checkout main
+
+# Create the data
+./benchmarks/bench.sh data
+
+# Gather baseline data for tpch benchmark
+./benchmarks/bench.sh run tpch
+
+# Switch to the branch the branch name is mybranch and gather data

Review Comment:
   Thanks for the review @yjshen  -- I am trying to reduce the amount of magic involved.
   
   I am going to merge this in and we can continue to iterate (next I would like to increase the number of different tests supported)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #6131: Add bench.sh script to automate benchmarking DataFusion against itself

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #6131:
URL: https://github.com/apache/arrow-datafusion/pull/6131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org