You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/22 04:45:10 UTC

[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1047420945


   Cross post from slack:
   
   I’m working on updating datafusions db-benchmark results based on datafusion v7.  i just got a first cut of the results compared to what i produced a couple months ago.  i was planning on finalizing the analysis before sharing but i wanted to provide a preview as i may not have time to finish for a day or two.  this was produced using datafusion-python on an M1 Macbook.
   
   on December 27th we were at the below for group by:
   ```
   0.11225258399999993 # q1
   0.695109333 # q2
   2.932470125 # q3
   0.07341450000000016 # q4
   3.3075385419999996 # q5
   2.9051008750000005 # q7
   4.573697916 # q8
   68.875322208 # q10
   ```
   
   based on datafusion version 7:
   ```
   q1: 0.03743266599999995
   q2: 0.4997687500000001
   q3: 2.119365208
   q4: 0.034825500000000176
   q5: 2.144292417
   q7: 2.0165450419999997
   q8: 2.9783209999999993
   q10: 47.229685542
   ```
   
   We’ve seen pretty good performance increases across the board based on the latest release.  Compared to currently published db-benchmark that would put datafusion as the fastest / tied for faster on groupby queries Q1 and Q4.  In general, we had similar results to spark.
   
   For join in december we had:
   ```
   q1 took 261 ms
   q2 took 367 ms
   q3 took 334 ms
   q4 took 507 ms
   q5 took 1936 ms
   ```
   
   and now we are at:
   ```
   q1: 0.5796001249999999
   q2: 0.4178434580000001
   q3: 0.4701954159999999
   q4: 0.4357888750000001
   q5: 1.8161980410000003
   ```
   we have lost some performance on the join side, im not sure why, but compared to other engines we are still doing very well, with basically the best performance across the board.
   
   Please take these results as preliminary…im still working through things. 
   
   Im going to work on adding the missing group by queries now with the latest v7 functionality.  i also was thinking of contributing a script that would run the whole db-benchmark process so that anyone could use run db-benchmark as needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org