You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 13:26:43 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #147: Add DataFusion to h2oai/db-benchmark

alamb opened a new issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11252
   
   I would like to see DataFusion added to h2oai/db-benchmark so that we can see how we compare to other solutions (including Pandas, Spark, cuDF, and Polars).
   
   Since Polars (another Rust DataFrame library that uses Arrow) has already been added, I am hoping that we can learn from their scripts.
   
   There is an issue filed against db-benchmark for adding DataFusion:
   
   https://github.com/h2oai/db-benchmark/issues/107


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1002019659


   > > For the avoidance of doubt, do we all agree that the Python solution will be the only solution submitted (at least for now)?
   > 
   > I think so -- and of improve the DataFusion python bindings in the process so much the better 👍 
   
   Sounds good to me as well. We can always add the native Rust version later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1005146595


   @realno thanks for the context. 
   
   you can find the PR that i am working on here (https://github.com/h2oai/db-benchmark/pull/240)
   
   you will need to download the data (directions here https://github.com/h2oai/db-benchmark#single-solution-benchmark) and add them to a data directory in the repo.
   
   within the PR i have you will see the scripts `datafusion/groupby-datafusion.py` and `join-datafusion.py`.
   
   to run the benchmarks you can do the following (of course youll have to install datafusion with pip):
   
   groupby
   ```
   SRC_DATANAME=G1_1e7_1e2_0_0 python datafusion/groupby-datafusion.py
   ```
   
   join
   ```
   SRC_DATANAME=J1_1e7_NA_0_0 python datafusion/join-datafusion.py
   ```
   
   hope this helps! let me know if any other questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001104385


   Went through the code and found the math functions that are supported. I think it would be nice to add documentation to the datafusion site on what's supported.  But of course that is a separate topic - i've created an issue for it.
   
   I think for the more advanced group by queries we'll need to add median, standard dev, and correlation functions.  ive created an issue for adding those as well - but hopefully we can submit benchmark without those and add them when the functionality is added.
   
   i think we'll be able to add query 8, going to work on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] ritchie46 commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001896583


   > I believe @ritchie46 has all these enabled in his python bindings
   
   Yes, only not a specific target cpu ofcourse. But LTO and SIMD work fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1002017427


   > For the avoidance of doubt, do we all agree that the Python solution will be the only solution submitted (at least for now)?
   
   I think so -- and of improve the DataFusion python bindings in the process so much the better 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1047420945


   Cross post from slack:
   
   I’m working on updating datafusions db-benchmark results based on datafusion v7.  i just got a first cut of the results compared to what i produced a couple months ago.  i was planning on finalizing the analysis before sharing but i wanted to provide a preview as i may not have time to finish for a day or two.  this was produced using datafusion-python on an M1 Macbook.
   
   on December 27th we were at the below for group by:
   ```
   0.11225258399999993 # q1
   0.695109333 # q2
   2.932470125 # q3
   0.07341450000000016 # q4
   3.3075385419999996 # q5
   2.9051008750000005 # q7
   4.573697916 # q8
   68.875322208 # q10
   ```
   
   based on datafusion version 7:
   ```
   q1: 0.03743266599999995
   q2: 0.4997687500000001
   q3: 2.119365208
   q4: 0.034825500000000176
   q5: 2.144292417
   q7: 2.0165450419999997
   q8: 2.9783209999999993
   q10: 47.229685542
   ```
   
   We’ve seen pretty good performance increases across the board based on the latest release.  Compared to currently published db-benchmark that would put datafusion as the fastest / tied for faster on groupby queries Q1 and Q4.  In general, we had similar results to spark.
   
   For join in december we had:
   ```
   q1 took 261 ms
   q2 took 367 ms
   q3 took 334 ms
   q4 took 507 ms
   q5 took 1936 ms
   ```
   
   and now we are at:
   ```
   q1: 0.5796001249999999
   q2: 0.4178434580000001
   q3: 0.4701954159999999
   q4: 0.4357888750000001
   q5: 1.8161980410000003
   ```
   we have lost some performance on the join side, im not sure why, but compared to other engines we are still doing very well, with basically the best performance across the board.
   
   Please take these results as preliminary…im still working through things. 
   
   Im going to work on adding the missing group by queries now with the latest v7 functionality.  i also was thinking of contributing a script that would run the whole db-benchmark process so that anyone could use run db-benchmark as needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jon-chuang commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
jon-chuang commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001031108


   @matthewmturner I'm assuming that's on 0.5 GB?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001181171


   > I think for the more advanced group by queries we'll need to add median, standard dev, and correlation functions. ive created an issue for adding those as well - but hopefully we can submit benchmark without those and add them when the functionality is added.
   
   FWIW standard deviation and correlation can be calculated using the existing aggregation functions (aka `AVG(X)` and `AVG(X^2)`), numerical precision issues not withstanding
   
   Median is harder -- I think it will need special casing as it can't be calculated using partial aggregates


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1000970995


   @alamb @houqp FYI i am picking up on the work @Dandandan started on this.
   
   below are the current results i have after adding join queries:
   ```
   group by
   q1 took 56 ms
   q2 took 289 ms
   q3 took 1305 ms
   q4 took 69 ms
   q5 took 1158 ms
   q7 took 1198 ms
   q10 took 24691 ms
   
   join
   q1 took 261 ms
   q2 took 367 ms
   q3 took 334 ms
   q4 took 507 ms
   q5 took 1936 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
realno edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048408377


   > @realno FYI see above for latest - as i know you expressed a specific interest in this.
   
   I am actually reading this and the slack channel, lol :D Thanks @matthewmturner for the update and progress! I will definitely participate in optimization as much as I can. This is a great start! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] bkmgit commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
bkmgit commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1002034225


   It would be great to compare the different language bindings. What obstacles are there to submitting Rust?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alippai commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
alippai commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001780011


   I believe @ritchie46 has all these enabled in his python bindings


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001559079


   > im also curious what people think about using the python bindings instead of rust? there are existing utilities we could leverage if so. however, can we generate an optimized build for python like we are in rust?
   
   
   In general, I think using the python bindings is a great idea for integration of datafusion with other systems. I don't know how very much about how to build / use them, but I would love to see more documentation on the process :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048397965


   Another cross post from slack
   
   This is the latest with the new groupby queries (6 and 9) included:
   ```
   q1: 0.038001750000000056
   q2: 0.46213770899999995
   q3: 2.206588334
   q4: 0.03716179199999958
   q5: 2.2481447910000005
   q6: 2.099691 NEW
   q7: 1.9977297499999995
   q8: 3.0949106670000006
   q9: 2.20049575 NEW
   q10: 49.882744625
   ```
   Only about half(5) of the engines benchmarked even complete these so that already puts us in a pretty good spot.  However, of those that do complete it we are on the slower side - about tied with the current slowest.
   however as @Dandandan  noted this is without some optimizations in place.  im going to work on adding those next.
   
   also im going to work on building an automation script so anyone can run this benchmark themselves and play with optimizations they have in mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
realno commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048447500


   > @realno oh and I had an issue with the approx_median function. I ended up having to use the quantile function instead. I didn't really get the chance to look into it though.
   
   I will look into it. If you can gather some info and create an issue please tag me on it. Thanks! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1000975093


   > @alamb @houqp FYI i am picking up on the work @Dandandan started on this.
   > 
   > below are the current results i have after adding join queries:
   > ```
   > group by
   > q1 took 56 ms
   > q2 took 289 ms
   > q3 took 1305 ms
   > q4 took 69 ms
   > q5 took 1158 ms
   > q7 took 1198 ms
   > q10 took 24691 ms
   > 
   > join
   > q1 took 261 ms
   > q2 took 367 ms
   > q3 took 334 ms
   > q4 took 507 ms
   > q5 took 1936 ms
   > ```
   
   Thank you very much @matthewmturner. From looking at your results and extrapolating a bit from my earlier benchmarking and published results it seems like DF does very well on the join queries. Hopefully we can do some real comparisons later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1002154177


   > It would be great to compare the different language bindings. What obstacles are there to submitting Rust?
   
   nothing too big.
   
   1. creating a utility that writes the required logging output
   2. coordination with h2oai team to get rust / cargo setup.  its not clear to me how much has been done based on @Dandandan initial request to get datafusion added.
   
   personally, i agree it would be great to add rust bindings. but i think starting with python and adding rust as a second step would still be a good improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1019666324


   As an update here the maintainer of the DB-Benchmark repo has left H2O-AI so directed me to reach out to their support team for assistance.  I have raised a ticket and will provide update here as it comes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001223018


   I feel a bit silly for asking this, but is the ability to raise a value / column to a power implemented in datafusion?  I've tried the below ways with no luck.  Am I missing something obvious?
   
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2**2;  🤔 Invalid statement: sql parser error: Expected end of statement, found: 2
   ```
   
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2^2;
   NotImplemented("Unsupported SQL binary operator BitwiseXor")
   ❯ select power(2,2);
   Plan("Invalid function 'power'")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001223018


   I feel a bit silly for asking this, but is the ability to raise a value / column to a power implemented in datafusion?  I've tried the below ways with no luck.  Am I missing something obvious?
   
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2**2;  🤔 Invalid statement: sql parser error: Expected end of statement, found: 2
   ```
   
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2^2;
   NotImplemented("Unsupported SQL binary operator BitwiseXor")
   ❯ select power(2,2);
   Plan("Invalid function 'power'")
   ```
   
   I get the same errors when running on table columns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001824253


   For the avoidance of doubt, do we all agree that the Python solution will be the only solution submitted (at least for now)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
realno commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1005140930


   I am really interested in the result and thank you @matthewmturner for the work! For some context, I am in the process of proposing and planning for the next-gen analytics platform for my org to improve the performance and scalability - DataFusion caught my eyes in early research. This benchmark will provide important information for decision making and I am happy to help.
   
   One small suggestion - if would be really nice if you can make the test script available in the repo so we can analyze the queries and run/debug them locally. I am new to the project so if it is already available somewhere I apologize and please give some pointers. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jon-chuang edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
jon-chuang edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001031108


   @matthewmturner I'm assuming that's on the 0.5 GB bench?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048406818


   @realno FYI see above for latest - as i know you expressed a specific interest in this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001709133


   > im also curious what people think about using the python bindings instead of rust?  there are existing utilities we could leverage if so.  however, can we generate an optimized build for python like we are in rust?
   > 
   > @houqp @alamb 
   
   Python bindings are also a good idea.
   
   There are a bit of optimizations not possible with the python bindings, so I would expect it to be a bit slower.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
realno commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048408377


   > l
   
   I am actually reading this and the slack channel, lol :D Thanks @matthewmturner for the update and progress! I will definitely participate in optimization as much as I can. This is a great start! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jon-chuang edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
jon-chuang edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001031108


   @matthewmturner I'm assuming that's on the 0.5 GB bench?
   
   It does seem to be a little lacking for group by in comparison to polars and others.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001770321


   @Dandandan for my information - which optimizations arent possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001230463


   im also curious what people think about using the python bindings instead of rust?  there are existing utilities we could leverage if so.  however, can we generate an optimized build for python like we are in rust?
   
   @houqp @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001090956


   @Dandandan my understanding of what i did was set the expected partitions at the `ExecutionContext` / `ExecutionConfig` level and then collect that into record batches from a `DataFrame` with `df.collect_partitioned` which would take into account the target partitions.  That being said i had to play around with that a lot to get a `MemTable` so im not confident its doing what i expected.  I'll look into it a little more though.
   
   If what i did was incorrect is there a more idiomatic way to create a `MemTable` with the desired number of partitions?
   
   Separately, i havent been able to find docs on the supported math functions.  Any info you could provide on that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001558342


   > I feel a bit silly for asking this, but is the ability to raise a value / column to a power implemented in datafusion? I've tried the below ways with no luck. Am I missing something obvious?
   
   I thought it was `pow` or `power` but when I tried to dig around I didn't find it implemented. I filed https://github.com/apache/arrow-datafusion/issues/1493 to track


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001042937


   > @matthewmturner I'm assuming that's on the 0.5 GB bench?
   > 
   > 
   > 
   > It does seem to be a little lacking for group by in comparison to polars and others.
   
   Yes. I still need to double check table setup and queries to make sure it's apples to apples though - so take it with a grain of salt for the moment. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001043213


   I also see the code which used to load data into partitions using `Mentale::load` now loads it into 1 partition (with `try_new`). Not sure if the difference is still big as we now do Round Robin repartitioning of the data, but would maybe still save a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001230463


   im also curious what people think about using the python bindings instead of rust?  can we generate an optimized build for python like we are in rust?
   
   @houqp @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001190119


   @alamb yes agree. The median is needed in same query with std deviation but im going to try adding the one with correlation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1048445246


   @realno oh and I had an issue with the approx_median function. I ended up having to use the quantile function instead. I didn't really get the chance to look into it though. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001772086


   > @Dandandan for my information - which optimizations arent possible?
   
   I meant, when using the published datafusion bindings, those optimizations are not possible or at least require some more work, that the native rust version does:
   
   * Compiling for the specific target CPU with `simd` feature enabled
   * Using a custom allocator
   * Full LTO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001707757


   I finished first draft at python bindings for the group by suite (https://github.com/matthewmturner/db-benchmark/blob/datafusion/datafusion/groupby-datafusion.py)
   
   Feedback welcome :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #147:
URL: https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1001223018


   I feel a bit silly for asking this, but is the ability to raise a value / column to a power implemented in datafusion?  I've tried the below ways with no luck.  Am I missing something obvious?
   
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2**2;  🤔 Invalid statement: sql parser error: Expected end of statement, found: 2
   ```
   DataFusion CLI v5.1.0
   
   ❯ select 2^2;
   NotImplemented("Unsupported SQL binary operator BitwiseXor")
   ❯ select power(2,2);
   Plan("Invalid function 'power'")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org