You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/26 15:34:23 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #4141: Easy DataFusion vs DataFusion benchmarking

alamb opened a new issue, #4141:
URL: https://github.com/apache/arrow-rs/issues/4141

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I want an easy way to run and compare the performance on branches for various database benchmarks. For example, I want a single command to run and get a report that tells me  "does this PR make DataFusion faster or slower". This most recently came up as part of https://github.com/apache/arrow-datafusion/pull/6034
   
   DataFusion has [several benchmark runners](https://github.com/apache/arrow-datafusion/tree/main/benchmarks) but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on https://github.com/apache/arrow-datafusion/pull/6034#issuecomment-1521511462)
   
   Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report. 
   
   This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)
   
   
   **Describe the solution you'd like**
   I want a documented methodology (ideally in a script) that will do:
   
   1. Setup (creates / downloads / whatveer) the datafiles needed
   2. Run <name> <optional arguments to restrict what benchmarks are run>  that writes timing information into log files
   3. Compare writes out a report comparing the runs
   
   
   Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems - e.g .YYYYY
   
   However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. Initially I need an easier way to compare DataFusion performance with a proposed change
   
   We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh
   
   however the TPCH script it is a fairly small set of queries and may not cover all the interesting usecases (e.g. many of its queries have non trivial joins)
   
   
   
   I would like to extend the built in benchmark runners to include:
   * Add clickbench benchmark runner in datafusion
   * Add h20ai style benchmark runner in datafusion: https://duckdb.org/2023/04/14/h2oai.html
   
   I propose renaming tpch to `runner` (and keep an alias for tpch)
   
   The runner should do:
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   **Additional context**
   This will likely result in cleaning up the runners in https://github.com/apache/arrow-datafusion/issues/5502
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb closed issue #4141: Easy DataFusion vs DataFusion benchmarking

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #4141: Easy DataFusion vs DataFusion benchmarking
URL: https://github.com/apache/arrow-rs/issues/4141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4141: Easy DataFusion vs DataFusion benchmarking

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4141:
URL: https://github.com/apache/arrow-rs/issues/4141#issuecomment-1523630509

   wrong repo -- sorry 🤦 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org