You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/03 16:16:30 UTC

[GitHub] [arrow-site] jorisvandenbossche edited a comment on pull request #168: Cross-posted blog post with DuckDB

jorisvandenbossche edited a comment on pull request #168:
URL: https://github.com/apache/arrow-site/pull/168#issuecomment-985642092


   Nice post!
   
   I don't have time for a detailed look right now, but some quick feedback on the benchmarks:
   
   * Since you are doing benchmarks and showing timings, I think it's best to include something about on what kind of machine this is run
   * Especially would be good to mention if duckdb was processing in parallel or not (and if so, how many cores the benchmarking machine has). The pandas part will always be single core, so the amount of cores can easily influence the difference in timing (it's of course still a feature of duckdb that it _does_ use those cores, to be clear, but it helps interpreting those results).
   * Personally I would also mention that you can do some of the projection / filter pushdown with pandas as well, but manually in the read_parquet call. I know the point is that duckdb nicely manages this all for you automatically (so you don't have to think about it as a user), but I think it comes across a bit more honest by at least "acknowledging" the possibility for pandas+pyarrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org