You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/07 14:39:17 UTC

[GitHub] [arrow-datafusion] isidentical opened a new pull request, #4128: Combined TPCH runs & uniformed summaries for benchmarks

isidentical opened a new pull request, #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4127.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   This PR adds support for executing TPCH benchmarks without a `--query`. When there is no `--query`, all the queries (from 1 to 22) is executed and the execution information regarding them are saved.
   
   - Summarry for [`tpch --query=1`](https://gist.github.com/isidentical/93229ea1b992f1bf5f931d599ba7ec7c)
   - Summary for [`tpch`](https://gist.github.com/isidentical/9455028d7907555e506528e1c577210a)
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   The TPCH benchmark output format is different now.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128#issuecomment-1305943969

   While playing with this, I've also written a little Python script to function like a benchmark comparison UI (poor man's conbench): https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f
   
   It is quite basic, but I think it can automate a few stuff for https://github.com/datafusion-contrib/benchmark-automation/tree/main. E.g. an example comparison between
   ```
    $ ./target/release/tpch benchmark datafusion --path /opt/data-parquet --format parquet --iterations 3 -o /tmp/benchmarks --disable-statistics
    $ ./target/release/tpch benchmark datafusion --path /opt/data-parquet --format parquet --iterations 3 -o /tmp/benchmarks
    $ python t.py compare /tmp/benchmarks/file1.json /tmp/benchmarks/file2.json
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query        ┃     Baseline ┃   Comparison ┃        Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ Q1           │     702.18ms │     687.86ms │     no change │
   │ Q2           │     413.74ms │     302.22ms │ +1.37x faster │
   │ Q3           │     392.94ms │     395.34ms │     no change │
   │ Q4           │     111.28ms │      97.01ms │ +1.15x faster │
   │ Q5           │     465.81ms │     487.92ms │     no change │
   │ Q6           │     402.94ms │     402.48ms │     no change │
   │ Q7           │     868.18ms │     889.51ms │     no change │
   │ Q8           │     499.98ms │     468.68ms │ +1.07x faster │
   │ Q9           │     827.54ms │     837.67ms │     no change │
   │ Q10          │     503.22ms │     492.29ms │     no change │
   │ Q11          │     221.30ms │     167.37ms │ +1.32x faster │
   │ Q12          │     204.10ms │     170.99ms │ +1.19x faster │
   │ Q13          │     441.50ms │     423.67ms │     no change │
   │ Q14          │     373.42ms │     383.57ms │     no change │
   │ Q15          │     356.24ms │     352.67ms │     no change │
   │ Q16          │     115.38ms │     117.98ms │     no change │
   │ Q17          │    2099.22ms │    2209.00ms │  1.05x slower │
   │ Q18          │    1255.95ms │    1285.39ms │     no change │
   │ Q19          │     656.93ms │     660.46ms │     no change │
   │ Q20          │     640.30ms │     624.94ms │     no change │
   │ Q21          │     697.55ms │     685.22ms │     no change │
   │ Q22          │      84.20ms │      81.76ms │     no change │
   └──────────────┴──────────────┴──────────────┴───────────────┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128#issuecomment-1308027653

   Benchmark runs are scheduled for baseline = b58ec81ab06af7a267ee69b834715727dbef963d and contender = a32fb657d1caec634cb53979b2f7ef2fad224905. a32fb657d1caec634cb53979b2f7ef2fad224905 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/80754457ce3b4433828568582ede21c4...6d5afdc118814c68acfbdbd2ebba0868/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e19e6d9e2ec64ec0b4231c057357444a...54b79daf4bc1451590bb811305f5a4d7/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b40f41aebfe1400d8326e2bbad678f45...db65ba7b8a9144a38ccc8fc81716aa60/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/48036338773c410db48b3cca796b99e3...4465f35b1b1243c1bd0dcc18d909fa81/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128#discussion_r1015873324


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -64,7 +64,7 @@ static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
 struct DataFusionBenchmarkOpt {
     /// Query number

Review Comment:
   The docstrings end up in the output of `--help` so I think it would be nice to mention what happens if this is not specified
   
   ```suggestion
       /// Query number. If not specified runs all queries
   ```



##########
benchmarks/README.md:
##########
@@ -49,6 +49,11 @@ The benchmark can then be run (assuming the data created from `dbgen` is in `./d
 cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
 ```
 
+If you omit `--query=<query_id>` argument, then all benchmarks will be run one by one (from query 1 to query 22).
+```bash
+cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 1 --batch-size 4096

Review Comment:
   should this example perhaps not have `--query 1`?
   
   ```suggestion
   cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --batch-size 4096
   ```



##########
benchmarks/src/bin/tpch.rs:
##########
@@ -182,29 +182,57 @@ async fn main() -> Result<()> {
     }
 }
 
-async fn benchmark_datafusion(opt: DataFusionBenchmarkOpt) -> Result<Vec<RecordBatch>> {
+const TPCH_QUERY_START_ID: usize = 1;
+const TPCH_QUERY_END_ID: usize = 22;
+
+async fn benchmark_datafusion(
+    opt: DataFusionBenchmarkOpt,
+) -> Result<Vec<Vec<RecordBatch>>> {
     println!("Running benchmarks with the following options: {:?}", opt);
-    let mut benchmark_run = BenchmarkRun::new(opt.query);
+    let query_range = match opt.query {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove merged pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
andygrove merged PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128#discussion_r1017211833


##########
benchmarks/README.md:
##########
@@ -49,6 +49,11 @@ The benchmark can then be run (assuming the data created from `dbgen` is in `./d
 cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
 ```
 
+If you omit `--query=<query_id>` argument, then all benchmarks will be run one by one (from query 1 to query 22).
+```bash
+cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 1 --batch-size 4096

Review Comment:
   Upps, that's a nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #4128: Combined TPCH runs & uniformed summaries for benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #4128:
URL: https://github.com/apache/arrow-datafusion/pull/4128#issuecomment-1307965708

   > I don't know how much you want to test the benchmark run code (it might be easier to just deal with any breakages than trying to prevent regressions through tests)
   
   Yeah, I tried to take a look at it but it seems like it would take a bit too much effort for a relatively simple feature. I guess we'll probably notice if something got broken when we have an automated benchmark system 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org