You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/12 12:22:35 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #5561: Report and compare benchmark runs against two branches

alamb opened a new issue, #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   When we make PRs like @jaylmiller 's https://github.com/apache/arrow-datafusion/pull/5292 or #3463  we often want to know "does this make existing benchmarks faster / slower". To answer this question we would like to:
   1. Run benchmarks on `main`
   2. Run benchmarks on the PR
   3. Compare the results
   
   This workflow is supported well for the criterion based microbenchmarks in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/benches (by using criterion directly or using the https://github.com/BurntSushi/critcmp)
   
   However, for the "end to end" benchmarks in https://github.com/apache/arrow-datafusion/tree/main/benchmarks there is no easy way I know of to do two runs and compare results. 
   
   **Describe the solution you'd like**
   There is a "machine readable" output format generated with the `-o` parameter (as shown below)
   
   1. I would like a script that that compares the output of two  benchmark runs. Ideally written either in bash or python.
   2. Instructions on how to run the script added to https://github.com/apache/arrow-datafusion/tree/main/benchmarks
   
   So the workflow would be 
   
   ### Step 1: to create two or more output files using `-o`:
   ```
   alamb@aal-dev:~/arrow-datafusion2/benchmarks$ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ~/tpch_data/parquet_data_SF1 --format parquet -o main
   ```
   
   This produces files like in [benchmarks.zip](https://github.com/apache/arrow-datafusion/files/10950794/benchmarks.zip). Here is an example
   
   
   ```json
   {
     "context": {
       "benchmark_version": "19.0.0",
       "datafusion_version": "19.0.0",
       "num_cpus": 8,
       "start_time": 1678622986,
       "arguments": [
         "benchmark",
         "datafusion",
         "--iterations",
         "5",
         "--path",
         "/home/alamb/tpch_data/parquet_data_SF1",
         "--format",
         "parquet",
         "-o",
         "main"
       ]
     },
     "queries": [
       {
         "query": 1,
         "iterations": [
           {
             "elapsed": 1555.030709,
             "row_count": 4
           },
           {
             "elapsed": 1533.61753,
             "row_count": 4
           },
           {
             "elapsed": 1551.0951309999998,
             "row_count": 4
           },
           {
             "elapsed": 1539.953467,
             "row_count": 4
           },
           {
             "elapsed": 1541.992357,
             "row_count": 4
           }
         ],
         "start_time": 1678622986
       },
       ...
   
   ```
   ### Step 2: Compare the two files and prepare a report
   
   ```shell
   benchmarks/compare_results branch.json main.json
   ```
   
   Which would produce an output report of some type. Here is an example  of an output output (from @korowa on https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1459826565). Maybe they have a script they could share
   
   
   ```
   Query               branch         main
   ----------------------------------------------
   Query 1 avg time:   1047.93 ms     1135.36 ms
   Query 2 avg time:   280.91 ms      286.69 ms
   Query 3 avg time:   323.87 ms      351.31 ms
   Query 4 avg time:   146.87 ms      146.58 ms
   Query 5 avg time:   482.85 ms      463.07 ms
   Query 6 avg time:   274.73 ms      342.29 ms
   Query 7 avg time:   750.73 ms      762.43 ms
   Query 8 avg time:   443.34 ms      426.89 ms
   Query 9 avg time:   821.48 ms      775.03 ms
   Query 10 avg time:  585.21 ms      584.16 ms
   Query 11 avg time:  247.56 ms      232.90 ms
   Query 12 avg time:  258.51 ms      231.19 ms
   Query 13 avg time:  899.16 ms      885.56 ms
   Query 14 avg time:  300.63 ms      282.56 ms
   Query 15 avg time:  346.36 ms      318.97 ms
   Query 16 avg time:  198.33 ms      184.26 ms
   Query 17 avg time:  4197.54 ms     4101.92 ms
   Query 18 avg time:  2726.41 ms     2548.96 ms
   Query 19 avg time:  566.67 ms      535.74 ms
   Query 20 avg time:  1193.82 ms     1319.49 ms
   Query 21 avg time:  1027.00 ms     1050.08 ms
   Query 22 avg time:  120.03 ms      111.32 ms
   ```
   
   
   **Describe alternatives you've considered**
   Another possibility might be to move the specialized benchmark binaries into `criterion` (so they look like "microbench"es but I think this is non ideal because of the number of parameters supported by the benchmarks
   
   
   **Additional context**
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1476451119

   Thanks @Taza53  -- I am testing it out now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1476742125

   I tried it out and it worked great (see https://github.com/apache/arrow-datafusion/pull/5099#issuecomment-1476741489). I will prepare a PR with the script and some instructions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Taza53 commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "Taza53 (via GitHub)" <gi...@apache.org>.

Taza53 commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1475437624

   python compare.py path1 path2
   ![image](https://user-images.githubusercontent.com/34032665/226216928-db7cf307-88cc-4d76-8307-bec605f10bc9.png)
   https://gist.github.com/Taza53/beb42c5918d352f9b760befaa87baef9
   I have made some changes to the original. Any recommendations that I can do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1477174617

   I've added `-o` functionality to all the e2e benches in #5658 so this new script should be usable with every bench.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Taza53 commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "Taza53 (via GitHub)" <gi...@apache.org>.

Taza53 commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1466520281

   > I spent some time gathering data (into [benchmarks.zip](https://github.com/apache/arrow-datafusion/files/10950794/benchmarks.zip) ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.
   
   Thank you for gathering data, it's very helpful.
   
   > @isidentical had a script they shared here: https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f
   
   I will take a look at it
   
   > BTW the first thing I hope/plan to do with this script is gather enough data to do #4085
   
   I am a bit unsure, can you elaborate on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1470495051

   I think this should support multiple benches: right now just `tpch` has the `-o` opt. 
   
   This means the other benches would be modified to have the `-o` option and made so that they output json in the same structure as `tpch` (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1470601602

   > This means the other benches would be modified to have the -o option and made so that they output json in the same structure as tpch (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?
   
   I think that sounds like a great idea -- thank you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #5561: Report and compare benchmark runs against two branches
URL: https://github.com/apache/arrow-datafusion/issues/5561


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] korowa commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "korowa (via GitHub)" <gi...@apache.org>.

korowa commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465272471

   > Here is an example of an output output (from @korowa on https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1459826565). Maybe they have a script they could share
   
   Unfortunately it was just
   ```
   cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./parquet --format parquet | grep "avg time"
   ```
    and "multiline cursor" feature of IDE 🥲 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1476287791

   That output looks awesome! 🚀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1466523715

   > > BTW the first thing I hope/plan to do with this script is gather enough data to do https://github.com/apache/arrow-datafusion/issues/4085
   
   > I am a bit unsure, can you elaborate on this.
   
   Yes -- sorry -- all I was trying to say is that I am excited to use the script and will try it likely as soon as you have it available for a "real" usecase (basically to test https://github.com/apache/arrow-datafusion/issues/4085)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1476751473

   Created https://github.com/apache/arrow-datafusion/pull/5655 -- thanks @Taza53  ❤️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Taza53 commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "Taza53 (via GitHub)" <gi...@apache.org>.

Taza53 commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465330603

   Hi, I would like to take a crack at it. I will try to do it in python.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.

jaylmiller commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465243239

   I could take this one up in the next few days if a new contributor does not end up picking this up as their first issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465865749

   That would be great @Taza53  -- thank you.
   
   I spent some time gathering data (into  [benchmarks.zip](https://github.com/apache/arrow-datafusion/files/10950794/benchmarks.zip) ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465863272

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1465936791

   BTW the first thing I hope/plan to do with this script is gather enough data to do https://github.com/apache/arrow-datafusion/issues/4085


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5561: Report and compare benchmark runs against two branches

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5561:
URL: https://github.com/apache/arrow-datafusion/issues/5561#issuecomment-1466000822

   @isidentical  had a script they shared here: https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org