You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/07 15:16:31 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #5504: Run DataFusion benchmarks regularly and track history

alamb opened a new issue, #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.

This means that
1. We may miss some performance regressions (such as https://github.com/apache/arrow-datafusion/issues/5325)
2. Since the benchmarks are not run regularly it is hard to know how to interpret results, and some seem to have bitrotted over time
3. The wide variety of available benchmarks (e.g. https://github.com/apache/arrow-datafusion/issues/5502) makes it hard to know which ones to run and how to determine if performance has improved or regressed for particular changes

**Describe the solution you'd like**
I would like
1. A system that runs DataFusion benchmarks regularly on main
2. Some automated way to see if a particular PR has improved or regressed performance
3. Bonus: a webpage that shows performance over time. Databend has a great example https://perf.databend.rs/

** Suggestion **
I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted

The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in https://github.com/apache/arrow-datafusion/pull/1791 by @dianaclarke

You can see its integration as it posts comments on PRs after merge such as https://github.com/apache/arrow-datafusion/pull/5476#issuecomment-1456748444

**Describe alternatives you've considered**
We could also figure out a new system as well

**Additional context**

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "Smurphy000 (via GitHub)" <gi...@apache.org>.

Smurphy000 commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1763127773

   I am interested in spending some time with helping out on benchmarking. Currently checking out conbench, but if another solution is needed I can look into that as well. 
   
   I'm interested to know how many permutations we need to track continually (eg. machine size, simd or not, etc).
   
   Looking for any additional thoughts / guidance besides what I have read so far in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "gruuya (via GitHub)" <gi...@apache.org>.

gruuya commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-2021563143

   Hey @korowa, thanks for your observations! Indeed my own experience suggests a slight bias against the PR results for some reason: typically a couple of queries reported with 1.05-1.3 slow-up (even if nothing is changed), so I'd ignore anything in that range atm (not sure it's worth increasing the 5% threshold though).
   
   That said, I think the current setup is sufficient to catch larger regressions for now. I also don't think increasing the number of iterations / sleeping between them on it's own would be good enough, since we'd trade that against the increased longitudinal performance variance component of the shared GitHub runners, but I guess it might be worth a try.
   
   The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
   
   Following that I see the list of improvements as:
   - (optional) simplify CI benchmark development: https://github.com/apache/arrow-datafusion/issues/9638
   - add more benchmarks (a selection of Clickbench + TPC-DS queries)
   - track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
   - (optional) re-base benchmarks on criterion.rs; this provides a neat standardized way of running/analyzing/collecting these stats in as-of-recently a bit user-friendlier manner: https://www.tweag.io/blog/2022-03-03-criterion-rs/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1519954797

   My high level plan is something like:
   - [ ] https://github.com/apache/arrow-datafusion/issues/6107
   - [ ] Write a script to run the benchmark and generate line protocol with a specific git repo / sha
   - [ ] write up instructions on how to visualize that data (maybe locally with grafana)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "gruuya (via GitHub)" <gi...@apache.org>.

gruuya commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1984174582

   Ok, so I went ahead and took a stab at 1. https://github.com/apache/arrow-datafusion/pull/9461
   
   Note that there are some issues with properly testing this out, as noted in the description (though i could be missing something GitHub action trick), and so the effort would require a follow up PR.
   
   The idea is to just run the usual benchmarks that devs do locally, and then post a comment with the results to the PR, so nothing fancy.
   
   I can also post some ideas I had about potential follow-up work later on if this makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "gruuya (via GitHub)" <gi...@apache.org>.

gruuya commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-2022700518

   > would you be willing to set it?
   
   Yup, I can do that.
   
   > Another potential thing to do is to increase the data size / time required per query.
   
   Yeah, I think something like adding TPC-H SF 10 would help somewhat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1975088196

   > IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.
   
   I agree this is a good idea / approach.
   
   Note I tried to figure out conbench in the past and I didn't get very far -- though I didn't spend a huge amount of focused time on it. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1974860217

   @gruuya I would be more than happy to help with the integration of Bencher, which could actually help with both 1. and 2.
   I set up a similar approach to 1. for Diesel recently: https://github.com/diesel-rs/diesel/pull/3849
   It used relative benchmarking, which looks something like this: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking
   So you only compare the base commit and the current/merge commit on the same runner, like you want.
   
   For 2. we would just benchmark on pushes to `main` and you would be able to visualize them with a Bencher Perf Page, [this in an example for the Rustls project](https://bencher.dev/perf/rustls-821705769).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1518262343

   > How are you considering running the benchmarks in CI @andygrove ?
   
   Using CI in general would be ideal.
   
   Using the github runners is probably not a great idea as they are quite variable (they are on shared machines and there isn't any visibility into what else is going on on those machines / VMs)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "andygrove (via GitHub)" <gi...@apache.org>.

andygrove commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1458380935

   I had started working on this, not just for DataFusion, but for other OSS query engines as well. 
   
   See https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ for an example.
   
   I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1764526813

   @alamb  that totally makes sense. I'm the maintainer of Bencher, so if you or @Smurphy000 would like a walk through/[to hop on a call](https://calendly.com/epompeii/call), that may be a pretty quick way to map concepts. Long term, I'm hoping that the docs get good enough to make this mapping super easy and intuitive. That's still a work in progress though 😃
   
   To answer your specific example about "num cpus" though, that would be metadata about the Testbed (https://bencher.dev/docs/explanation/benchmarking/#testbed). Currently the best way to have that available in Bencher itself is to include it in the name. You can compare benchmarks results by Testbed. (ex "Debian x86 4 CPU 8GB" vs "Debian x86 8 CPU 16GB")
   
   And definitely explore all your options @Smurphy000 ! If you haven't seen it already, I've compiled a pretty comprehensive list of prior art in the space, that may be helpful here: https://bencher.dev/docs/reference/prior-art/
   Let me know if you find anything that you think I should add!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epompeii commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1511304087

   How are you considering running the benchmarks in CI @andygrove ?
   I've been working on a tool similar to `conbench` called  Bencher (https://github.com/bencherdev/bencher), and I've been trying to explore the best ways to do continuous benchmarking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1733363121

   It seems as if conbench is now more actively maintained that it once was and claims to have rust support: https://github.com/conbench/conbench
   
   Perhaps it is worth another look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1986378984

   Thanks @gruuya  -- I'll check out https://github.com/apache/arrow-datafusion/pull/9461 shortly. I am back catching up with alerts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1721306824

   So I can find the compute resources to run this, and I think we can use existing timeseries databases to track performance over time (e.g. IOx or grafana). The only missing piece is getting the data into a format that can be loaded into these systems
   
   I think we could get a simple version of this feature going like:
   1. Invoke [bench.sh](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/bench.sh) for the benchmarks of interest
   2. Use https://github.com/apache/arrow-datafusion/issues/6107 to convert the existing bench output to lineprotocol (I wrote up more details of this transformation)
   3. Check in the line protocol somewhere
   4. Upload lineprotocol to influxdb or some other system that supports it to make some simple visualizations
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1969025707

   @gruuya  and I were talking about this project recently and he said he may have some ideas on how to proceed / push it forward. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-2022602956

   > The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
   
   I agree with this. How can we make it happen? If we found a monetary sponsor to host the runner, would you be willing to set it up?
   
   Another potential thing to do is to increase the data size / time required per query. As @korowa  notes when the queries take only a few 10s of ms to run, the variablity related to the runners is a significant portion of overall query time. I think this will still be a problem with a dedicated runner. 
   
   This would of course increase the time required to run benchmarks. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epompeii commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1521804491

   @alamb this is effectively what I have built with Bencher. It tracks benchmarks over time, allows you to visualize the results (and share them as an auto-updating README image), and uses statistical thresholds to generate alerts, in the case of performance changes. There's a CLI and a GitHub Action to make it easier to run both locally and in CI.
   
   > Using CI in general would be ideal.
   
   Definitely! As for compute, I've heard folks mention https://buildjet.com as an option. In the long run, I'm looking to add AWS Bare Metal runners to Bencher to help with this end of things. So please let me know if you are interested in going that direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "Smurphy000 (via GitHub)" <gi...@apache.org>.

Smurphy000 commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1764371318

   Currently I am looking into the few options we currently have (conbench, bencher, custom) and I can update here with my findings so some agreement on the best option can be made. I am starting with @alamb idea selfishly to learn a few new things, but I definitely want to take a look into existing frameworks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1763385917

   @alamb the port of InfluxDB over to Rust is super cool. Congrats!
   I'm considering using it long term, if/when Bencher needs a supplemental backend for results storage.
   
   Before @Smurphy000 goes reinventing the wheel here though, I just wanted to point out that Bencher handles all of this out of the box. All you would need to do is add a flag to your existing runner to output JSON in the expected format, and you're set: https://bencher.dev/docs/explanation/adapters#-json
   Bencher would then handle Step 1 (example of [continuous benchmark dogfooding with Bencher](https://bencher.dev/perf/bencher?key=true&metric_kind=latency&tab=benchmarks&testbeds=0d991aac-b241-493a-8b0f-8d41419455d2&branches=619d15ed-0fbd-4ccb-86cb-fddf3124da29&benchmarks=3525f177-fc8f-4a92-bd2f-dda7c4e15699%2C1db23e93-f909-40aa-bf42-838cc7ae05f5&start_time=1674950400000&reports_per_page=4&reports_page=1&branches_per_page=8&branches_page=1&testbeds_per_page=8&testbeds_page=1&benchmarks_per_page=8&benchmarks_page=1))
   And the Bencher CLI makes it trivial to handle Step 2: `bencher run "./benchmarks/bench.sh data"`
   Then you could also detect performance regressions in CI and even have it comment on PRs: https://bencher.dev/docs/explanation/thresholds/
   
   As far as the orchestration of jobs from GitHub -> dedicated VM, this is something I am actively exploring making easier to do. I'm looking at creating an extension to Bencher for creating GitHub Actions Self-Hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners
   I'm also open to other ideas if you all have any!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "epompeii (via GitHub)" <gi...@apache.org>.

epompeii commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-2027604140

   > track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
   
   @gruuya let me know if you run into anything or have any questions getting setup with Bencher.
   I would be more than happy to help answer any questions or with parts of the integration work here.
   
   > (optional) re-base benchmarks on criterion.rs
   
   If you do move to Criterion, Bencher has a [built-in adapter for Criterion](https://bencher.dev/docs/explanation/adapters/#-rust-criterion) which should make things pretty simple.
   
   > The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
   
   @alamb if you all want to build things out yourselves, the [Rustls bench runner](https://github.com/rustls/rustls-bench-app) may be a good starting point. I recently wrote [a case study of the Rustls continuous benchmarking setup](https://bencher.dev/learn/case-study/rustls/) if that is of interest.
   
   Another possibility is that I am working on Bencher Cloud Bare Metal. It is a fleet of identical hardware and benchmarks are run on the bare metal servers. I can go into more detail if that is something you all want to explore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ozankabak commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1461276727

   > I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.
   
   This would be fantastic!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5504: Run DataFusion benchmarks regularly and track history

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1519940707

   I was thinking about this the other day -- and it seems to me this might be a perfect usecase for a timeseries database (which full disclosure we have at InfluxData). I was thinking that one way to record the history over time is use a widely supported format like LineProtocol  -- https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/
   
   Then we can visualize and display that data over time using existing tools like grafana. Also the "alert if things get slower" sounds a lot like the kinds of alerts used in timeseries databases. 
   
   I'll try and whip up some way to convert benchmark results into line protocol over the next few weeks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1764299997

   @epompeii  -- I would love it if someone could explore the possibility -- I haven't had the time to look into whatever bencher supports. There are many different continious benchmarking frameworks / services and the reason I think people keep reinventing the wheel is that the mental effort to integrate with the frameworks (e.g. understanding the bencher json format , understanding what a slug is, or how to map important metadata like "num cpus" to this model) is often larger than creating something custom (or it least appears so from the outside)
   
   In my opinion this is why we don't use conbench -- no one has invested enough time to meld that model with the datafusion numbers
   
   I agree it is infuriatingly repetitive. However  unless you (or someone else) directly integrates DataFusion with one of these frameworks in a way that works, I am going to expend my effort explaining what is easiest for me (aka tools I already know), unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1763339527

   Hi @Smurphy000  that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for.
   
   The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating.
   
   # Step 1: transform benchmark data for graphing
   
   I first recommend checking out https://github.com/apache/arrow-datafusion/issues/6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner. 
   
   In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well 
   
   # Step 2: script to gather data for each commit
   
   So in my mind, the ideal solution looks like:
   1. A runner script that runs the benchmarks, and appends the results to some sort of text file (ideally https://github.com/apache/arrow-datafusion/issues/6107) that we can check in and that is easy to visualize
   2. Written in one of the existing languages used in the DataFusion repo: python, bash, or rust
   
   
   If you fancy a bit of bash scripting, maybe you could potentially start with  [bench.sh](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/bench.sh) and extend it to check out the desired SHAs (I could do do this part too, if you were able to do https://github.com/apache/arrow-datafusion/issues/6107)
   
   
   Here is  one way a testing session might look
   ```
   # setup, fetch all needed data files
   ./benchmarks/bench.sh data 
   
   # Run tpch benchmarh on commit 4819e7a, 
   # leave results in ./benchmarks/results/4819e7a
   # Would add appropriate lines to  ./benchmarks/results/history.lp
   ./benchmarks/bench.sh run tpch --commit 4819e7a 
   ```
   
   Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know)
   
   Does that make sense?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "gruuya (via GitHub)" <gi...@apache.org>.

gruuya commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1973862429

Ok, I think I'm up to speed with the history of this issue now.

To me it seems that there are 2 separate action items discussed here:
1. CI benchmark that will comment on PRs once merged to main, and possibly on open PRs after invoking some hook (points 1 and 2 in the issue description). This will be just a summary of the perf comparison between the base commit and the current/merge commit. This might be done using github shared runners, as presumably the variance wouldn't be too wild during the same run.
2. As an extension of 1. above, a way to persist this info, and ideally visualize it over time. This would involve a bit more effort, as it would require setting up (and maintaining) dedicated infra to control for the variance. I guess the idea was to be checking in the [benchmark numbers](https://github.com/apache/arrow-datafusion/issues/6107) into some(this?) github repo, as opposed to a cloud DB somewhere?

Also looks like both [conbench](https://github.com/apache/arrow/pull/39592#issuecomment-1969889622) and [bencher](https://github.com/bencherdev/bencher?tab=readme-ov-file#example-pr-comment) are eligible to handle the above.

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Posted by "korowa (via GitHub)" <gi...@apache.org>.

korowa commented on issue #5504:
URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-2021270695

   Just some thoughts based on #9800 -- at this moment benchmark results may have significant fluctuations (which is expected), but on the other side, basing on example from #9461 (https://github.com/splitgraph/arrow-datafusion/pull/1) shows that is fine to use them for tracking some significant performance regressions.
   
   Maybe we should increase 5% "no change" threshold for these benchmarks? It won't be able to show majority of improvements, but it still will be able to highlight if changes are literally breaking in terms of performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org